J. igv^ 


Zipcode 


Group ID 


21 


10000 


1 


27 


18000 


1 


32 


35000 


2 


32 


35000 


2 


54 


60000 


3 


60 


63000 


3 


60 


63000 


3 


60 


63000 


3 



Group ID 


Disease 


1 


dyspepsia 


1 


flu 


2 


bronchitis 


2 


gastritis 


3 




3 


dvsnensi a 


3 


flu 


3 


gastritis 





Zipcode 


Disease 


[21 271 


[10k 18kl 


flu 


[21, 271 


[10k, 18k] 


dyspepsia 


r32 541 

l^^, ^-TJ 


[35k 60kl 


CTastritis 


[32 541 

L-'^5 ^-rj 


[35k 60kl 


bronchitis 


[32, 541 


[35k, 60k] 


flu 


60 


63000 

\J ^ V/ V/ V/ 


dvsnensia 


60 


63000 


diabetes 


60 


63000 


gastritis 



Name 


Aee 


ZiDcode 


Disease 


Ann 


21 


10000 


flu 


Bob 


27 


18000 


dyspepsia 


Gate 


32 


35000 


gastritis 


Don 


32 


35000 


bronchitis 


Ed 


54 


60000 


flu 


Fred 


60 


63000 


dvsnensia 


Gill 


60 


63000 


diabetes 


Hera 


60 


63000 


gastritis 





Zipcode 


Disease 


[21 321 


[10k, 22k] 


dyspepsia 


[21, 321 


[10k, 22k] 


gastritis 


[27 361 


[18k, 37k] 


flu 


[27, 361 


[18k, 37k] 


gastritis 


[54 601 


[60k 63kl 


bronchitis 


[54, 60] 


[60k, 63k] 


flu 


60 


63000 


dyspepsia 


60 


63000 


diabetes 
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Numerous generalization techniques have been proposed for privacy preserving data publishing. 
Most existing techniques, however, implicitly assume that the adversary knows little about the 
anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against 
privacy attacks that exploit various characteristics of the anonymization mechanism. This paper 
provides a practical solution to the above problem. First, we propose an analytical model for 
evaluating disclosure risks, when an adversary knows everything in the anonymization process, 
except the sensitive values. Based on this model, we develop a privacy principle, transparent 
l-diversity, which ensures privacy protection against such powerful adversaries. We identify three 
algorithms that achieve transparent (-diversity, and verify their effectiveness and efficiency through 
extensive experiments with real data. 
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1. INTRODUCTION 

Privacy protection is highly important in the pubhcation of sensitive personal in- 
formation (referred to as microdata), such as census data and medical records. A 
common practice in anonymization is to remove the identifiers (e.g., social security 
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Name 


Age 


Zipcode 


Disease 


Ann 


21 


10000 


dyspepsia 


Bob 


27 


18000 


flu 


Gate 


32 


35000 


gastritis 


Don 


32 


35000 


bronchitis 


Ed 


54 


60000 


gastritis 


Fred 


60 


63000 


flu 


Gill 


60 


63000 


dyspepsia 


Hera 


60 


63000 


diabetes 



Table I. Microdata Ti 



Name 


Age 


Zipcode 


Ann 


21 


10000 


Bob 


27 


18000 


Bruce 


29 


19000 


Gate 


32 


35000 


Don 


32 


35000 


Ed 


54 


60000 


Fred 


60 


63000 


Gill 


60 


63000 


Hera 


60 


63000 



Table II. Voter List Ei 



Age 


Zipcode 


Disease 


[21, 27] 


[10k, 18k] 


dyspepsia 


[21, 27] 


[10k, 18k] 


flu 


32 


35000 


gastritis 


32 


35000 


bronchitis 


[54, 60] 


[60k, 63k] 


gastritis 


[54, 60] 


[60k, 63k] 


flu 


[54, 60] 


[60k, 63k] 


dyspepsia 


[54, 60] 


[60k, 63k] 


diabetes 



Table III. Generalization Tj* 



numbers or names) that uniquely determine entities of interest. This, however, is 
not sufficient because an adversary may utihze the remaining attributes to identify 
individuals [Samarati 2001]. For instance, consider that a hospital publishes the 
microdata in Table I, without disclosing the patient names. Utilizing the publicly- 
accessible voter registration list in Table II, an adversary can still discover Ann's 
disease, by joining Tables I and II. The joining attributes {Age, Zipcode} are called 
the quasi-identifiers (QI). 

Generalization [Samarati 2001] is a popular solution to the above problem. It 
works by first assigning tuples to Ql-groups, and then transforming the QI values in 
each group to an identical form. As an example. Table III illustrates a generalized 
version of Table I with three Ql-groups. Specifically, the first, second, and third QI- 
groups contain the tuples {Ann, Bob}, {Gate, Don}, and {Ed, Fred, Gill, Hera}, 
respectively. Even with the voter registration list in Table II, an adversary still 
cannot decide whether Ann owns the first or second tuple in Table III. i.e., Ann's 
disease cannot be inferred with absolute certainty. 

Generalizations can be divided into global receding and local receding [LeFevre 
et al. 2005]. The former demands that if two tuples have identical QI values, they 
must be generalized to the same Ql-group. Without this constraint, the general- 
ization is said to use local recoding. For instance. Table III obeys global rccoding. 
Notice that Gate and Don have equivalent QI- values in the microdata (Table I), 
and therefore must be included in the same Ql-group. This is also true for Fred, 
Gill, and Hera. 

The privacy-preservation power of generalization relies on the underlying privacy 
principle, which determines what is a publishable Ql-group. Numerous principles 
are available in the literature, offering different degrees of privacy protection. One 
popular, intuitive and effective principle is I -diversity [Machanavajjhala ct al. 2007]. 
It requires that, in each Ql-group, at most l/l of the tuples can have the same 
sensitive value^. This ensures that an adversary can have at most l/l confidence 
in inferring the sensitive information of an individual. For example. Table III is 
2-diverse. Thus, an adversary can discover the disease of a person with at most 
50% probability. 

Interestingly, none of the existing privacy principles (except those in [Wong et al. 
2007] and [Zhang et al. 2007] ) specifies any requirement on the algorithm that pro- 
duce the generalized tables. Instead, they impose constraints only on the formation 



^ There also exist other formulations of Z-diversity, as will be discussed in Section 2.1 
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of the Ql-groups (like ^-diversity does), which, unfortunately, leaves open the op- 
portunity for an adversary to breach privacy by exploiting the characteristics of the 
generalization algorithm. This problem is first pointed out by Wong et al. [2007], 
who demonstrate a minimality attacfc^ that (i) can compromise most existing gener- 
alization techniques, and (ii) requires only a small amount of knowledge about the 
generalization algorithm. As a solution, they propose an anonymization approach 
that can guard against minimality attacks. 

The work by Wong et al. reveals an essential issue in publishing microdata: a 
generalization method should preserve privacy, even against adversaries with knowl- 
edge of the anonymization algorithm. Towards addressing this issue, the techniques 
in [Wong et al. 2007] establish the first step by dealing with minimality attacks, 
which, however, is still insufficient for privacy protection. Specifically, given in- 
formation about the anonymization method, an adversary can easily devise other 
types of attacks to circumvent a generalized table. To explain this, in the following 
we first clarify how minimality attacks work, and then, elaborate the deficiencies 
of [Wong et al. 2007]. 

Minimality Attacks. Good generalization should keep the QI values as accurate 
as possible. Towards this objective, the previous algorithms [Bayardo and Agrawal 
2005; Fung et al. 2005; Ghinita et al. 2007; LeFevre et al. 2005; 2006a; Xiao and 
Tao 2007] produce minimal generalizations, where no QTgroup can be divided into 
smaller groups without violating the underlying privacy principle. For example, 
Table III is a minimal 2-diverse generalization of Table I under global receding. In 
particular, the first (second) QTgroup in Table III cannot be divided, since any 
split of the group results in two QTgroups with a single tuple, which apparently 
cannot be 2-diverse. On the other hand, as Fred, Gill, and Hera have identical QI 
values, their tuples must be in the same Ql-group, as demanded by global recoding. 
Therefore, the only way to partition the third Ql-group is to break it into {Ed} 
and {Fred, Gill, Hera}, which also violate 2-diversity. 

Minimal generalizations can lead to severe privacy breach. Consider that a hos- 
pital holds the microdata in Table IV, and releases the 2-diverse Table V, which 
is a minimal generalization under global recoding. Assume that an adversary has 
access to the voter registration list in Table II. Then, s/he can easily identify the 
six individuals in the second Ql-group G2 — {Gate, Don, Ed, Fred, Gill, Hera} 
in Table V. After that, the adversary can infer the diseases of Gate and Don by 
reasoning as follows (i.e., a minimality attack). First, there exist only two tuples in 
G2 with the same disease, which is gastritis. Second, since Table V is minimal, if we 
split G2 into two parts G3 = {Gate, Don} and G4 — {Ed, Fred, Gill, Hera}, either 
G3 or G4 must violate 2-diversity. Assume that G4 is not 2-diverse. In that case, 
at least three tuples in G4 should have an identical sensitive value, contradicting 
the fact that, in G2, the maximum number of tuples with the same Disease value is 
2. It follows that G3 cannot be 2-diverse, indicating that both Gate and Don have 
the same disease, which must be gastritis (as mentioned earlier, no other disease is 



^Note that minimality attaclt can be effective only when the microdata is anonymized with gen- 
eralization or a similar methodology called anatomy [Xiao and Tao 2006a]. There exist other 
anonymization methods that are immune to attacks based on knowledge of the anonymization 
algorithm, as will be discussed in Section 5. 
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Name 


Age 


Zipcode 


Disease 


Ann 


21 


10000 


dyspepsia 


Bob 


27 


18000 


flu 


Gate 


32 


35000 


gastritis 


Don 


32 


35000 


gastritis 


Ed 


54 


60000 


bronchitis 


Fred 


60 


63000 


flu 


Gill 


60 


63000 


dyspepsia 


Hera 


60 


63000 


diabetes 



Table IV. Microdata T3 



Age 


Zipcode 


Disease 


[21, 27] 


[10k, 18k] 


dyspepsia 


[21, 27] 


[10k, 18k] 


flu 


[32, 60] 


[35k, 63k] 


gastritis 


[32, 60] 


[35k, 63k] 


gastritis 


[32, 60] 


[35k, 63k] 


bronchitis 


[32, 60] 


[35k, 63k] 


flu 


[32, 60] 


[35k, 63k] 


dyspepsia 


[32, 60] 


[35k, 63k] 


diabetes 



Table V. Generalization 



Algorithm Vul-Gen (T) 

1. if T is tlie microdata Ti in Table I 

return tlie generalization in Table V 

2. otherwise, return a generalization of T that is different from T4 



Fig. 1. The Vul-Gen algorithm 



possessed by two tuples in G2). 

Motivation. Wong et al. [2007] advance the other solutions by assuming that 
an adversary has one extra piece of knowledge: whether the anonymization algo- 
rithm produces a minimal generalization (note: the adversary is not allowed to have 
other details of the algorithm). Under this assumption, minimality attacks can be 
prevented using a simple solution — just deploy non-minimal generalizations. Nev- 
ertheless, given knowledge of the algorithm, can the adversary employ other types 
of attacks to compromise non-minimal generalizations? The answer, unfortunately, 
is positive, as can be demonstrated in a simple example as follows. 

Example 1. Consider the conceptual anonymization algorithm Vu/- Gen in Fig- 
ure 1. The algorithm takes as input a microdata table T, and generates a general- 
ization T* of T. In particular, Vul-Gen outputs the generalization in Table V, if 
and only if T equals the microdata Ti in Table I. Notice that, T| is not a minimal 
2-diverse version of Ti. This is because, the second Ql-group of T|, including the 
tuples {Gate, Don, Ed, Fred, Gill, Hera}, can be divided into 2-diverse Ql-groups 
{Gate, Don} and {Ed, Fred, Gill, Hera}, which conform to global recoding. 

Assume that a data publisher applies Vul-Gen on Ti, and releases the resulting 
2-diverse generalization T^. Since is not minimal, it does not suffer from min- 
imality attacks. However, imagine an adversary who knows that Vul-Gen is the 
generalization algorithm adopted by the publisher. Once T4 is released, the adver- 
sary immediately concludes that Ti is the microdata, because Vul-Gen outputs T4 
if and only if the input is Ti. Hence, the adversary learns the exact disease of every 
individual, i.e., releasing T4 causes a severe privacy breach. □ 

It is clear from the above discussion that preventing minimality attacks alone is 
insufficient for privacy preservation, since an adversary (with understanding about 
the generalization algorithm) may employ numerous other types of attacks to infer 
sensitive information. This leads to a challenging problem: how can we anonymize 
the microdata in a way that proactively prevents all privacy attacks that may be 
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launched based on knowledge of the algorithm! 

Zhang et al. [2007] present the first theoretical study on the above problem. 
The core of their solution is a privacy model in which the anonymization algorithm 
(adopted by the publisher) is assumed to be public knowledge^. As wiU be discussed 
in Section 2.3, however, Zhang et al.'s privacy model is only applicable on a small 
subset of anonymization algorithms that (i) are deterministic, (ii) adopt global 
recoding generalization, and (iii) follow a particular algorithmic framework. This 
severely restricts the design of new anonymization approaches under the model, and 
makes it impossible to verify the privacy guarantees of existing randomized or local- 
recoding-based algorithms. Furthermore, the anonymization algorithms proposed 
by Zhang et al. have high time complexities: All but one algorithm run in time 
exponential in the number n of tuples in the microdata, while the remaining one 
has a time complexity that is polynomial in n and the total number m of possible 
generalizations of the microdata. Note that, in practice, m can be an exponential of 
n, since there may exist an exponential number of ways to divide the tuples in the 
microdata into Ql-groups. As a consequence, the algorithms developed by Zhang 
et al. are rather inapplicable in practice. 

Contributions. This paper develops a practical solution for data publishing 
against an adversary who knows the anonymization algorithm. First, we propose a 
model for evaluating the degree of privacy protection achieved by an anonymized 
table, assuming that the adversary has knowledge of (i) the anonymization algo- 
rithm employed by the publisher, (ii) the algorithmic parameters with which the 
anonymized table is computed, and (iii) the QI values of all individuals in the 
microdata. Our model captures all deterministic and randomized generalization 
algorithms [Aggarwal et al. 2006; Bayardo and Agrawal 2005; Fung et al. 2005; 
Ghinita et al. 2007; LeFevre et al. 2005; 2006a; 2006b; Iyengar 2002; Wang et al. 
2004; Wong et al. 2006; Xiao and Tao 2007; Xu et al. 2006; Wong et al. 2007; 
Zhang et al. 2007], regardless of whether they adopt global recoding or local recod- 
ing. The model is even applicable for anonymized tables produced from anatomy 
[Xiao and Tao 2006a] , a popular anonymization methodology that will be clarified 
in Section 2.1. Based on this model, we develop a new privacy principle called 
transparent I -diversity, which safeguards privacy against the adversary we consider. 

As a second step, we identify two sufficient conditions for transparent ^-diversity, 
based on which we propose three anonymization algorithms that achieve transpar- 
ent Z-diversity. None of these algorithms could have been possible under Zhang 
et al.'s privacy model, as they are either randomized or based on local recoding. 
We provide detailed analysis on the characteristics of each algorithm, and show 
that they all run in 0{n^ logn) time. In addition, we demonstrate the effective- 
ness and efficiency of our algorithms through extensive experiments with real data. 
Compared with the existing anonymization techniques that do not ensure transpar- 
ent l-diversity, our solutions not only provide stronger privacy protection, but also 
achieve satisfactory performance in terms of data utility and computation overhead. 

The rest of the paper is organized as follows. Section 2 presents the theoretical 



^This is reminiscent of Kerckhoffs' principle (well adopted in cryptography): a cryptographic 
system should be secure, even if everything about the system, except the key, is public knowledge. 
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framework that underlies transparent ^-diversity. Section 3 presents our generaliza- 
tion algorithms, which are experimentally evaluated in Section 4. Section 5 surveys 
the previous work related to ours. Finally, Section 6 concludes the paper with 
directions for future research. 

2. PRIVACY MODEL 

This section presents our analytical model for assessing disclosure risks. In Sec- 
tion 2.1, we formalize several basic concepts. After that. Section 2.2 elaborates 
the derivation of disclosure risks. Section 2.3 discusses the differences between our 
model and the methods in [Wong et al. 2007] and [Zhang et al. 2007]. 

2.1 Preliminaries 

Let T be a microdata table to be published. We assume that T contains d + 2 
attributes, namely, (i) an identifier attribute A^'^, which is the primary key of T, 
(ii) a sensitive attribute A'*, and (iii) d QI attributes A^, A^. As in most existing 
work, we require that A" should be categorical, while the other attributes can be 
either numerical or categorical. 

For each tuple t in T, let t[A] be the value of t on the attribute A. We define a 
Ql-group as a set of tuples, and a partition of T as a set of disjoint Ql-groups of T 
whose union equals T. We say that two Ql-groups Gi and G2 are isomorphic, if 
(i) Gi and G2 contain the same multi-set of sensitive values, and (ii) every tuple 
ti e Gi shares the same identifier and Ql values with a tuple ^2 G G2, and vice 
versa. For instance, let Gi be a Ql-group that contains the first two tuples in 
Table I. Suppose that we swap the sensitive values of Ann and Bob, such that Ann 
(Bob) has flu {dyspepsia). Then, the resulting Ql-group G2 is isomorphic to Gi. 

We formalize the anonymization of T as follows. 

Definition 1 (Anonymization). An anonymization function / is a func- 
tion that maps a Ql-group to another set of tuples, such that for any two isomorphic 
Ql-groups Gi and G2, f{Gi) ~ /(G2) always holds. Given a partition P of T and 
an anonymization function f , a table T* is an anonymization ofT decided by P 
and f, if and only if T* ^ Uggp 

There exist two popular types of anonymization methodologies, namely, gener- 
alization [Samarati 2001] and anatomy [Xiao and Tao 2006a]. Specifically, gener- 
alization employs an anonymization function that maps a Ql-group G to a set G* 
of tuples, such that (i) for any tuple t* G G* , t*\A1] [i € [1,^]) is an interval con- 
taining all AJ values in G, and (ii) any two tuples in G* have the same QI values. 
Anatomy, on the other hand, adopts an anonymization function that transforms 
a Ql-group G to two separate sets of tuples, such that first (second) set contains 
only the QI (sensitive) values in G. For example, given a partition of Table I that 
contains three Ql-groups {Ann, Bob}, {Gate, Don}, and {Ed, Fred, Gill, Hera}, 
Table VI illustrates an anonymization of Table I produced from anatomy. Observe 
that Table Via (VIb) contains only the QI (sensitive) values in Table I. 

The techniques developed in this paper can be incorporated with any anonymiza- 
tion method that conforms to Definition 1. For ease of exposition, in the rest of the 
paper we will adopt a specific anonymization function, namely, the MBR (Minimum 
Bounding Rectangle) generalization function [Bayardo and Agrawal 2005; Ghinita 
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Age 


Zipcode 


Group ID 


21 


10000 


1 


27 


18000 


1 


32 


35000 


2 


32 


35000 


2 


54 


60000 


3 


60 


63000 


3 


60 


63000 


3 


60 


63000 


3 



Group ID 


Disease 


1 


dyspepsia 


1 


flu 


2 


bronchitis 


2 


gastritis 


3 


diabetes 


3 


dyspepsia 


3 


flu 


3 


gastritis 



(a) The QI table (b) The sensitive table 

Table VI. An anonymization of Table I produced from anatomy 



et al. 2007; LeFevre et al. 2006a; Xiao and Tao 2007]. This function anonymizes a 
Ql-group G by replacing each {i e [1,^]) value with the tightest interval that 
contains all AJ values in G. For instance, Table III is obtained by applying the MBR 
function to a partition of Table I with three Ql-groups {Ann, Bob}, {Gate, Don}, 
and {Ed, Fred, Gill, Hera}. 

Let T* be the anonymization of T released by the publisher. T* should satisfy 
l-diversity: 

Definition 2 (/-Diversity [Machanavajjhala et al. 2007]). A Ql-group 
G is i-diverse, if and only if it contains at most \G\/l tuples with the same sensitive 
value. A partition is I -diverse, if and only if each of its Ql-groups is I -diverse. An 
anonymization is I -diverse, if and only if it is produced from an I -diverse partition. 

It is noteworthy that there exist several different definitions of /-diversity 
[Machanavajjhala et al. 2007]. For example, entropy l-diversity requires that the 
entropy of sensitive values in each Ql-group should be at least In I; recursive (c,l)- 
diversity demands that, even if we remove I — 2 arbitrary sensitive values in a Ql- 
group G, at most c fraction of the remaining tuples should have the same sensitive 
value. Definition 2 corresponds to a simplified version of recursive (c, Z)-diversity, 
and has been widely adopted previously [Ghinita et al. 2007; Xiao and Tao 2006a; 
Wong et al. 2006; Wong et al. 2007]. 

Let Q be the anonymization algorithm adopted by the publisher. Q can be either 
deterministic or randomized, but it should be an l-diversity algorithm. That is, 
Q should take as input any microdata T' and any positive integer I, and output 
either or an /-diverse anonymization of T'. In particular, Q may return 0, when 
no /-diverse anonymization exists for T'. For instance, given the microdata Ti in 
Table I, no algorithm can generate a 10-diverse anonymization, since Ti contains 
only 8 tuples. 

Gonsider an adversary who tries to infer sensitive information from T*. As 
demonstrated in Section 1, the adversary may employ an external source (e.g., 
a voter registration list) to identify the individuals involved in T*. More formally, 
we define an external source i? as a table that contains all attributes in T, except 
A". In addition, for each tuple t E T, there should exist a unique record e E E, 
such that t and e coincide on all identifier and QI attributes. In other words, each 
individual in T should appear in E, but not necessarily vice versa. For example, 
the external source Ei in Table II involves all individuals in the microdata Ti in 
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Table I, but it also contains the information of Bruce, who does not appear in Ti. 

In addition to E and T*, we also assume that the adversary knows the details 
of the anonymization algorithm Q and the value of / used by the publisher (in 
practice, I can be inferred from T* [Wong et al. 2007]). We quantify the disclosure 
risks incurred by the publication of T* as: 

Definition 3 (Disclosure Risk). For any individual o, the disclosure risk 

risk{o) of o in T* is the tight upper-bound of the adversary's posterior belief in the 
event that "o appears in T and has a sensitive value v", given T* , any sensitive 
value V, the external source E, the algorithm Q, and the value of I: 

risk[o) — max Pr{o appears in T and has a sensitive value v \ T* AE/\QAl} , (1) 

where Pr{X \ Y} denotes the conditional probability of event X given the occurrence 
of event Y . 

2.2 Disclosure Risks in Anonymized Tables 

Next, we present a detailed analysis of disclosure risks. Before examining T*, the 
adversary has no information about (i) which individuals in the external source E 
appear in T, and (ii) what is the sensitive value of each person. Thus, from the 
adversary's perspective, there exist many possible instances of the microdata. In 
particular, each instance T may involve any individuals in E, and each person in 
T can have an arbitrary sensitive value. We formally define such instances as: 

Definition 4 (Possible Microdata Instance). Given an external source 
E, a possible microdata instance based on E is a microdata table T that con- 
tains a subset of the individuals in E, such that each of these individuals have the 
same QI values in E and T (the sensitive value of each individual in T can be 
arbitrary) . 

For example, given the external source in Table II, Table VII is a possible mi- 
crodata instance. Note that, the microdata T itself is also a possible instance. In 
general, possible instances may be completely different from T, e.g.. Table I and 
Table VII do not even have the same cardinality. Nevertheless, it is reasonable to 
assume that, before inspecting T*, the adversary considers each possible instance to 
be equally likely. This assumption is referred to as the random worlds assumption 
[Bacchus et al. 1996], and is adopted by most existing work on data anonymization^ 
[Byun et al. 2006; Chen et al. 2007; Kifer and Gehrke 2006; LeFevre et al. 2006b; 
Li et al. 2007; Martin et al. 2007; Nergiz et al. 2007; Wang and Fung 2006; Wong 
et al. 2006; Wong et al. 2007; Xiao and Tao 2006b; 2007; Zhang et al. 2007; Zhang 
et al. 2007]. 

Let S be the set of all possible microdata instances based on E. Now, consider 
that the adversary has obtained T* , the anonymization algorithm Q, and the pa- 
rameter I. For simplicity, assume for the moment that Q is deterministic. The 
adversary can utilize the algorithm Q to refine S. Specifically, s/he can apply Q on 



^Recent research [Kifer 2009] shows that, when the random worlds assumption does not hold, 
some of the existing anonymization methods are vulnerable to privacy attacks based on machine 
learning techniques. The treatment of such privacy attacks is beyond of the scope of this paper. 
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Name 


Age 


Zipcode 


Disease 


Bruce 


29 


19000 


bronchitis 


Gate 


32 


35000 


flu 


Fred 


60 


63000 


dyspepsia 



Table VII. A Possible Microdata Instanee Based on Table II 
Algorithm Opt-Gen (T, I) 

1. Sp = a, set containing all partitions P of T, such that P and the MBR function 
decide an /-diverse global receding generalization 

2. if S'p = then return 

3. among all P £ Sp, select the one that minimizes X^gsp i'^l^ 

4. return the generalization determined by P and the MBR function 

Fig. 2. The Opt-Gen algorithm 

each instance T e 5, and inspect the output of Q. If T leads to an anonymization 
different from T*, the adversary asserts that, T is not the real microdata T. Let S' 
be the set of instances that pass the sanity check, i.e., for each T G S' , G{T, I) — T* 
(apparently, T G S"). 

The adversary then uses S' to infer the sensitive information in T. As a special 
case, if an individual o is associated with an A* value v in all instances in S", then 
V must be the A" value of o in T. In general, the probability that o has w in T 
depends on the portion of instances in S' where o has v. We refer to the above 
inference approach as a reverse engineering attack. 

Example 2. Consider the ^-diversity generalization algorithm Opt-Gen, as 
shown in Figure 2. In a nutshell, Opt-Gen employs the MBR function, and re- 
turns Z-diverse generalizations that (i) obey global recoding, and (ii) minimize the 
discernability metric [Bayardo and Agrawal 2005]. Specifically, the discernability 
of a generalized table T* equals X)ggp I^P' where P is the partition that decides 

Suppose that a publisher adopts Opt-Gen to anonymize the microdata Ti in 
Table I, setting I to 2. Table III illustrates the resulting generalization Tj*. Assume 
that an adversary has the external source Ei in Table II, and knows Opt-Gen and 
1 = 2. To launch a reverse engineering attack, s/he first constructs the set S of 
all possible microdata instances based on Ei (e.g., Table VII is one instance in 
S). As a second step, the adversary invokes Opt-Gen on each T G S, and verifies 
whether the output of Opt- Gen is T2 . Let S" be the maximal subset of S such that 
Opt-Gen{f,2) = T^ for each feS'. In the sequel, we wiU show that every f e S' 
must associate Ed with gastritis. Namely, based on T2*, Ei, 1 = 2, and the details 
of Opt- Gen, the adversary can infer the exact disease of Ed. 

Let Gi, G2, and G3 be the first, second, and third Ql-group in T2, respectively. 
Any T (z S, which can be generalized to Tj*, must satisfy the following conditions. 
First, T should not involve Bruce, since his age 29 is not covered by any Age interval 
in T2 . Second, T should either (i) associate Ann with dyspepsia and Bob with flu, 
or (ii) conversely, associate Ann and Bob with flu and dyspepsia, respectively. This 
is because, Ann and Bob are the only individuals whose ages fall in the Age interval 
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[21, 27] of Gi, while Gi contains two sensitive values dyspepsia and flu. By the same 
reasoning, T should assign the diseases in G2 (G3) to Gate and Don (Ed, Fred, Gill, 
and Hera). 

We are now ready to prove that, any possible microdata instance in S' must set 
the sensitive value of Ed to gastritis. Assume, on the contrary, that this is not 
true in a r' G S". Then, since Ed is in G3, his disease in T' must be one of {flu, 
diabetes, dyspepsia}, i.e., the sensitive values in G3 except gastritis. In that case, 
Ed's disease in T' must differ from those of Gate and Don (each of whom suffers 
from either gastritis or bronchitis in T'). Hence, we can construct a 2-diverse QI- 
group G2 ={Gate, Don, Ed}. The other tuples in T' can also form two 2-diverse 
QTgroups G[ = {Ann, Bob}, and G(j = {Fred, Gill, Hera}. 

Let P' = {G'l, G2, G3}, which decides a 2-diverse global receding generalization. 
Let us refer to that generalization as T'* . The discernability of T'* is 2^-1-3^ + 3^ = 
22, which is smaller than the discernability 24 of Tj*. As Opt-Gen minimizes the 
discernability, given T' as the input, it should have output T'* instead of T2*, leading 
to a contradiction. In conclusion, Ed must be assigned a sensitive value gastritis in 
any f e S". □ 

The above discussion motivates the following proposition for computing disclo- 
sure risks. 

Proposition 1. Let o be any individual, E be an external source, and T* be 
an anonymization ofT produced with an l-diversity algorithm Q and a parameter I. 
Let S be the set of possible microdata instances based on E . Let So.v be the maximal 
subset of S , such that each instance T £ So.v includes a tuple t, with t[A'^'^] = a and 
tlA"] = V. Then, 

riskio) = max ■ — , (2 

"^^^ j:fesP^{G{T,l)^T*} 

where Pr{Q{T,l) T*} denotes the probability that, given T and I, algorithm Q 
outputs T* . 

The proofs of all propositions, lemmas, and theorems can be found in the ap- 
pendix. We are now ready to introduce the transparent l-diversity principle, for 
protecting privacy when the anonymization algorithm is "transparent" to adver- 
saries. 

Definition 5 (Transparent ^-Diversity). An anonymization T* of T is 
transparently i-diverse if, given any external source, T* ensures risk[o) < l/l 
for any individual o. An l-diversity algorithm Q is transparent, if and only if 
given any microdata T and any positive integer I, algorithm Q outputs either or 
a transparently l-diverse anonymization of T . 

Intuitively, an ^-diversity algorithm Q is transparent, if and only if each output 
T* of Q can be generated from a set S of possible microdata instances, such that 
each individual o is associated with a diverse set of sensitive values in different 
instances. As the adversary cannot decide which instance in S corresponds to the 
input microdata, s/he would not be able to infer the exact sensitive value of o from 
T* . The fact that each instance in S can lead to T* implies that the output of Q 
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should not be highly dependent on the sensitive value of any particular individual. 
For instance, the Opt-Gen algorithm fails in Example 2, because it outputs T2 
(in Table III) only if Ed has a sensitive value gastritis. In general, a transparent 
algorithm should anonymize data in a manner such that none of the steps of the 
anonymization process is uniquely decided by the sensitive value of a particular 
tuple. In Section 3, we will present three transparent algorithms that are developed 
according to the above principle. 

2.3 Comparison with Previous Work 

As explained in Section 1, [Wong et al. 2007] and [Zhang et al. 2007] are the only 
previous works that do not assume adversaries with no knowledge of the anonymiza- 
tion algorithm Q. In this section, we elaborate the solutions in [Wong et al. 2007] 
and [Zhang et al. 2007] , and point out how they differ from our solution. 

Comparison with [Wong et al. 2007]. The privacy model in [Wong et al. 
2007] assumes that (i) the anonymization algorithm Q is deterministic, and (ii) the 
adversary knows whether Q produces minimal generalization. To clarify the model, 
we begin by reviewing several concepts in [Wong et al. 2007] . 

Definition 6 (Child Partition). Let Pi and P2 be two partitions ofT. P2 
is a child 0/ Pi, if and only if there exist Gi € Pi and G2, G3 £ P2, such that (i) 
Gi = G2 U G3, and (11) Pi ~ {Gi} = P2 - {G2, G3}. 

Note that we can obtain a child of a partition P, by splitting a Ql-group in P 
into two smaller Ql-groups. 

Definition 7 (Minimal Generalization). Let f be a generalization func- 
tion, P an l-diverse partition, and T* the generalization decided by f and P. T* is 
a minimal i-diverse generalization under global (local) recoding, if f and any 
child of P cannot decide an l-diverse generalization under the same recoding. 

For example. Table III is a minimal 2-diverse generalization of Table I with 
respect to the MBR function and global recoding, as explained in Section 1. Given 
a generalization function / and recoding scheme H, we say that an ^-diversity 
algorithm is minimal, if it produces only minimal generalizations under / and H . 
The subsequent discussion will focus on minimal algorithms Q, because the results 
of [Wong et al. 2007] are inapplicable to non- minimal algorithms (i.e., minimality 
attacks cannot be performed if Q is non-minimal). 

In a similar fashion to Definition 1, Wong et al. [2007] formulate the disclosure 
risks (referred to as credibilities in [Wong et al. 2007]) as: 

Definition 8 (Credibility). Let be any individual, and V be a predefined 
subset of the values in . The credibiHty of in T* is the adversary 's maximum 
posterior belief in the event that "o appears in T and has a sensitive value v ", given 
T* , an external source E, generalization function f , recoding scheme H , value of 
I, and Q being minimal: 

cred(o) ~ max Prjo has v in T \ T* A E A f A H A I A Q is minimal} . 

Note that the credibility model quantifies disclosure risks based only on a subset 
V of the A* values. To facilitate the comparison between the credibility model and 
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our privacy model, we assume V — A'^ in the rest of the paper. 
Credibihties can be derived as: 

Proposition 2 [Wong et AL. 2007]. Let o, E, f, H, I be as introduced in 
Definition 8. Let be the set including any possible microdata instance T based 
on E, such that T* is a minimal l-diverse generalization ofT with respect to f and 
H . Let S'^^ be the maximal subset of , such that in each instance in 5+, o is 
associated with a sensitive value v. We have 

cred(o) — max IS'j" |/|S'"'"|. (3) 

The following analysis will confirm the intuition that credibilities underestimate 
the actual privacy risks, when an adversary knows everything about Q. Towards 
this, let us revisit the scenario in Example 2, where the adversary can precisely 
find out Ed's disease with a reverse engineering attack, i.e, the disclosure risk of Ed 
equals the maximum value 1. In the sequel, we will show that cred(Ed) = 1/4. 

Lemma 1. The Opt-Gen algorithm (in Figure 2) is a minimal algorithm. 

Example 3. Consider the settings in Example 2, where T^Ti, T*=T2, E = Ei, 
Q— Opt-Gen, 1 — 2, a— Ed. Since Opt-Gen is a minimal algorithm (see Lemma 1), 
by Proposition 2, the credibility of Ed in Tj* is calculated as max^^A^ I I / 1 | , 
where 5*+ is the set of all possible microdata instances that have T2* as a minimal 
generalization, and 5+^ is the subset of instances in 5+ that associate Ed with a 
certain sensitive value v. 

Let T be any possible microdata instance based on Ei. As demonstrated in 
Example 2, if T can be generalized to Tj*, then T must not involve Bruce. Further- 
more, T should assign the sensitive values in the first, second, and third QLgroups 
in T2 to {Ann, Bob}, {Gate, Don}, and {Ed, Fred, Gill, Hera}, respectively. To- 
tally, there are 2! x 2!x4!=:96 different combinations between the sensitive values 
and individuals. This leads to a set Sm of 96 possible microdata instances. For 
any v = gastritis, flu, dyspepsia, or diabetes, there exist 24 instances in Sm that 
associate Ed with v. Since Sm includes all possible microdata instances that can 
be generalized to T2, we have S'^ C Sm- 

Next, we will prove 5+ = Sm- For this purpose, it suffices to establish that, 
for any instance T S Sm, T2 is a minimal 2-diverse generalization with respect to 
global recoding and the MBR function /. Let Gi = {Ann, Bob}, G2 = {Gate, Don}, 
G3 = {Ed, Fred, Gill, Hera}. The partition underlying is Pi = {Gi,G2,G3}. 
Assume, on the contrary, that T2 is not minimal for some T G Sm- Then, there 
exists a partition P2 of T such that (i) P2 is a child of Pi , and (ii) P2 and / decide 
a 2-diverse global recoding generalization. 

As P2 is a child of Pi, by Definition 6, we can obtain P2 from Pi by splitting 
Gi, G2, or G3. However, it is impossible to split Gi (G2) into 2-diverse QLgroups, 
since it contains only two tuples. On the other hand, G3 cannot be divided either. 
This is because, Fred, Gill, and Hera have identical QI values, and thus, have to be 
in the same QLgroup (due to global recoding); meanwhile, the remaining tuple Ed 
itself does not make a 2-diverse QLgroup. Hence, under global recoding, no child 
of Pi can lead to a 2-diverse generalization of T. It follows that T2 is a minimal 
generalization of every T G Sm, i-e., 5+ ~ Sm- 
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Finally, since (as mentioned earlier) there exist exactly 24 instances in Sm that 
assign the same sensitive value to Ed, cred{Ed) = max^g^s I 'S'j'^ | / 1 5'+ 1 = 24/96 = 
1/4. ' □ 

Since the credibility model cannot secure privacy against an adversary who knows 
the anonymization algorithm, any method developed based on the model is sus- 
ceptible to revere engineering attacks. To demonstrate this, we exemplify in the 
electronic appendix an attack against Mask, an anonymization approach devised in 
[Wong et al. 2007] under the credibility model. 

Comparison with [Zhang et al. 2007]. Zhang et al. [2007] consider the pub- 
lication of microdata using deterministic algorithms that adopt global recoding. 
They model a global recoding generalization as a projection of the microdata into 
a "coarsened" multi-dimensional domain. For example, given the microdata Ti in 
Table I, we can coarsen the Age domain, so that it contains only seven values: 
"< 20", "[21,27]", "(27,32)", "32", "(32,54)", "[54,60]", and "> 60". Similarly, 
we can define a coarsened Zipcode domain that has only seven values: "< 10k" , 
"[10k, 18k]", "(18k, 35k)", "35k", "(35k, 60k)", "[60k, 63k]", and "> 63k". Accord- 
ingly, the global recoding generalization T2 in Table III can be regarded as the 
projection of Ti into the three-dimensional domain spanned by Disease and the 
coarsened Age and Zipcode. Let C be the set of all coarsened multi-dimensional 
domains that can be constructed from the attributes in the microdata. Zhang et 
al. assume that the domains in C can be totally ordered by their information loss, 
which measures the degree of coarseness of the domains. For example, the informa- 
tion loss of a domain is (i) minimized if no coarsening is applied, and (ii) maximized 
if every attribute is maximally coarsened. 

Zhang et al. consider that the publisher adopts a deterministic generalization al- 
gorithm Q as follows. Given a microdata T and a privacy principle, Q first examines 
the multi-dimensional domains in C in ascending order of their information loss. 
For each domain D*, Q projects T into D* , and checks whether the resulting gener- 
alization satisfies the given privacy principle. If the principle is satisfied, Q returns 
the generalization and terminates; otherwise, Q moves on to the next domain in 
C In other words, Q always outputs the first generalization that conforms to the 
adopted privacy principle. Alternatively, Q may also traverse C in descending order 
of information loss, and returns the last generalization on which the given principle 
is satisfied. The adversary is assumed to (i) have an external source E that contains 
only the individuals in the microdata, and (ii) know the privacy principle as well 
as the order in which Q traverses C. 

Under the above problem setting, Zhang et al. present a theoretical study on how 
Q should be designed to prevent the adversary from inferring private information. 
Let Up be the total number of possible microdata instances based on E. Zhang et al. 
first prove that it is NP-hard (with respect to Up) to compute a generalization that 
both ensures privacy and incurs the minimum information loss. After that, they 
investigate three special cases of the problem by imposing various constraints on C 
and the privacy principle. For each case, they show that the optimal generalization 
can be computed in time polynomial in rip and the size of C . Finally, they propose 
a generalization algorithm that ensures entropy l-diversity (see Section 2.1), and 
prove that its time complexity is polynomial in |C| and independent of rip. Note 
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that, in practice, both |C| and Up are usuahy exponential in the number n of tuples 
in the microdata. 

Compared with the solution in [Wong et al. 2007], Zhang et al.'s techniques 
achieve a higher level of privacy protection, as they can guard against an adversary 
who has full knowledge of the anonymization algorithm. Nevertheless, Zhang et 
al.'s work has the following limitations. First, the privacy model in [Zhang et al. 
2007] is restricted to a particular type of deterministic algorithms that adopt global 
recoding. Consequently, the model cannot be used to evaluate the privacy guaran- 
tee of any existing anonymization algorithm that is randomized or local-recoding- 
based, nor does it support the development of new anonymization approaches of 
those kinds. Second, all algorithms proposed in [Zhang et al. 2007] have time com- 
plexities exponential in the number n of tuples in the microdata, and there is no 
experimental evaluation included in [Zhang et al. 2007] to demonstrate the effec- 
tiveness or efhciency of the algorithms. This leaves open the question of whether 
or not the algorithms in [Zhang et al. 2007] are applicable in practice. 

Our work remedies the deficiencies of [Zhang et al. 2007]. In particular, our 
privacy model captures all (deterministic or randomized) anonymization algorithms 
that adopt generalization or anatomy. This general model enables us to design three 
transparent anonymization algorithms, all of which fall beyond Zhang et al.'s model 
as they rely on random choices and/or local recoding. In addition, as will be shown 
in Section 3, our algorithms run in O(n^logn) time, which significantly improves 
over the exponential time complexities of Zhang et al.'s techniques. Finally, we 
will present in Section 4 an extensive experimental study that demonstrates the 
practical performance of our algorithms in terms of data utility and computation 
time. 



3. ACHIEVING TRANSPARENT L-DIVERSITY 

Equipped with the analytical model in Section 2, our next step is to develop trans- 
parent anonymization algorithms for Z-diversity. Ideally, an algorithm should pro- 
duce anonymizations with minimum information loss, according to a certain penalty 
metric h. Specifically, /i is a function that, given a Ql-group G, calculates a penalty 
h{G) based on the tuples in G. Given h, the information loss of an anonymization 
T* is computed as X^ggp ^i^)^ where P is the partition underlying T*. For exam- 
ple, the discernability metric deployed in Example 2 corresponds to a function hd 
such that hd{G) = \G\'^ for any Ql-group G. 

In the following, we will elaborate three transparent algorithms, each of which 
can be combined with any penalty metric /i, as long as the metric (i) does not 
rely on the sensitive values in the input Ql-group, and (ii) is superadditive, i.e., 
h{Gi U G2) > h{Gi) + h{G2) holds for any disjoint Ql-groups Gi and 6*2. For our 
discussion, we use the perimeter function hp [Ghinita et al. 2007; Iyengar 2002] as 
a representative: 

h(r\-\r\ ^ ^^"^teG {tjAf]} - mintgc {t[Af]} 
hp{G)-\G\-)_^^ ■ (4) 

Given a set Sq of Ql-groups, we refer to X^ggSg ^p(^) perimeter of Sg- 
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3.1 The Ta //or Algorithm 

3.1.1 Algorithm Description. This section presents a transparent algorithm, 
Tailor, which produces anonymized tables in a manner similar to the construc- 
tion of kd-trees [Friedman et al. 1977]. Tailor requires the microdata T to be 
l-eligible. That is, at most \T\/l tuples in T have the same sensitive value. If 
T is not /-eligible, Tailor returns 0, since no /-diverse anonymization of T exists 
[Machanavajjhala et al. 2007]. 

Given an /-eligible T, Tailor first creates a partition P with only one Ql-group 
Go, which includes all tuples in T. As a second step, Tailor tries to split Go into 
two /-diverse subsets Gi and G2 subject to certain constraints to be clarified later. 
If splitting is possible, Tailor removes Go from P, and inserts Gi and G2 in P. This 
decreases the perimeter of P. After that. Tailor recursively splits a QTgroup in 
P, until no Ql-group can be divided further, i.e., the perimeter of P has reached a 
local minimum. Then, Tailor terminates, and outputs the anonymization decided 
by P and an anonymization function (e.g., the MBR function). 

Whenever Tailor divides a QTgroup G into subsets Gq and Gb, {Ga,Gfc} must 
be an l-cut: 

Definition 9 (/-Cut). Let G be a Ql-group, I be a positive integer, and c be 
the maximum number of tuples in G with the same sensitive value. An Z-cut of G 
on Af {i G [l,d]) is an ordered set {Ga,Gb} of Ql-groups, such that: 

(1) GaUGb^ G, and Ga n Gh = 0. 
{2) \Ga\ >l-c and\Gb\ >l-c. 

{3) For any ta G Ga and tb € Gb, either (i) ta[Al] < tb[Al], or (n) ta[Al] = tb[Al] 
and talA'"^] < tb[A"^]. 

The perimeter of the l-cut is the total perimeter of Ga and Gb. 

Condition 2 in Definition 9 implies that G (on which the /-cut is performed) is 
2/-diverse. Condition 3 requires, intuitively, that all tuples in Ga must precede 
those in Gb, along the dimension Af on which G is divided. 

Interestingly, as long as G is 2l-diverse, there exists at least one l-cut on any 
QTattribute Af (i G Such a cut can be found as follows. First, we sort the 

tuples in G in ascending order of their Aj values. In case two tuples have the same 
value on A"^, the tuple with a smaller identifier precedes the other. Then, we create 
Gq by including the first k tuples in the sorted sequence (for any k G [l-c,\G\ — l-c]), 
and construct Gb using the remaining tuples. 

The above strategy yields totally d • (|G| + 1 — 2/ • c) different /-cuts. Among them. 
Tailor always selects the canonical one: 

Definition 10 (Canonical /-Cut). The canonical Z-cut of a QTgroup G is 
the l-cut with the smallest perimeter. In case multiple l-cuts have the smallest 
parameter, the canonical l-cut {Gq,G{,} is uniquely decided as follows. Assume 
{Ga,Gb} is on dimension AJ (i G [l,c/]^; then: 

(1) No l-cut on any A'j (j < i) has the same perimeter as {Ga,Gb}. 

(2) For any l-cut {G^, G[,} on Af, if {G'a, GJ,} and {Ga, Gb} have the same perime- 
ter, it must hold that \Ga\ < |GJ,|. 
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Algorithm Tailor (T, I) 

1. if T is not Z-eligible then return 

2. Go — a Ql-group containing all tuples in T, and P = {Go} 

3. while there exists a 2Z-diverse Ql-group G in P 

4. {Ga, Gb} = the canonical l-cui of G 

5. P = P-{G} + {Ga,Gb} 

6. return the anonymization decided by P and an anonymization fucntion 



Fig. 3. The Tailor algorithm 



Name 


Age 


Zipcode 


Disease 


Ann 


21 


10000 


dyspepsia 


Bob 


27 


18000 


flu 


Gate 


32 


35000 


gastritis 


Don 


32 


35000 


gastritis 


Ed 


54 


60000 


flu 


Fred 


60 


63000 


bronchitis 


Gill 


60 


63000 


dyspepsia 


Hera 


60 


63000 


diabetes 



Tabic VIII. Microdata T5 



Age 


Zipcode 


Disease 


[21, 32] 


[10k, 35k] 


dyspepsia 


[21, 32] 


[10k, 35k] 


flu 


[21, 32] 


[10k, 35k] 


gastritis 


[21, 32] 


[10k, 35k] 


gastritis 


[54, 60] 


[60k, 63k] 


flu 


[54, 60] 


[60k, 63k] 


bronchitis 


60 


63000 


dyspepsia 


60 


63000 


diabetes 



Table IX. Generalization T^ 



Note that the canonical Z-cut of a Ql-group G is determined by (i) the identifiers 
and QI values in G, as well as (ii) the maximum number c of tuples in G with the 
same sensitive value ~ all of this information is independent of the concrete sensitive 
value of any particular tuple. This property is the key to ensuring transparent l- 
diversity, as will be discussed in Section 3.1.2. 

Figure 3 shows the pseudo-code of Tailor. We demonstrate the algorithm with 
an example, assuming that the MBR function is adopted. 

Example 4. Let us use TaiZor to obtain a transparently 2-diverse generalization 
of the microdata in Table VIII (i.e., T — and 1 — 2). Tailor first verifies that 
T5 is 2-eligible (Line 1 in Figure 3), and then initializes a partition P — {Go}, 
where Go ~ T5 (Line 2). The subsequent execution of Tailor is in iterations (Lines 
3-5). In each iteration. Tailor looks for a 4 (= 21) diverse Ql-group G in P (Line 
3). If G does not exist. Tailor terminates, and returns the generalization decided 
by P (Line 6). Otherwise, Tai/or splits G using its canonical ^-cut (Lines 4-5), and 
replaces G with the new Ql-groups. 

Specifically, in the first iteration, the only Ql-groirp Go in P is 4-diverse, and 
hence, is chosen to be split. Tailor identifies c — 2, which, as in Definition 9, is 
the largest number of tuples in Go having the same sensitive value. Then, Tailor 
proceeds to find the canonical 2-cut of Gq. For this purpose, it needs to obtain the 
best 2-cut (with the smallest perimeter) along every dimension. Dealing with Age 
first. Tailor sorts the tuples in Go by their Age values, and tries all possibilities of 
dividing the sorted list into two parts, each with at least 4 (= 2c) tuples (required 
by condition 2 in Definition 9). There is only possibility: {G2,G3}, where G2 = 
{Ann, Bob, Gate, Don}, and G3 = {Ed, Fred, Gill, Hera}. Hence, {G2,G3} is the 
best 2-cut on Age. Switching to dimension Zipcode, Tailor sorts the tuples in Go by 
their Zipcode values, and again, attempts all division possibilities. Again, {G2, G3} 
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is the only possibility, and hence, is also the best 2-cut on Zipcode. Hence, {G2, G3} 
is the canonical 2-cut. Tailor thus replaces Gq with G2 and G3 in P. 

In the second iteration, P = {G2,G3}. As G2 is not 4-diverse, it cannot be 
split. But G3 is 4-diverse, and thus, is split using its canonical cut {G4, G5}, where 
G4 = {Ed, Fred} and G5 = {Gill, Hera}. Now, P becomes {G2,G4,G5}. Since 
no Ql-group is 4-diverse, Tailor returns the generalization Tg determined by P, as 
shown in Table IX. □ 

Tailor is deterministic, i.e., for any T, /, and T*, Pr{Tailor{T,l) = T*} (see 
Proposition 1) equals either or 1. In addition. Tailor has an O(n^logn) time 
complexity, where n is the number of tuples in T. This follows from the facts that 
(i) Tailor performs at most n/l Z-cuts on T, and (ii) each Z-cut takes O(nlogn) 
time. 

3.1.2 Proof of Transparent l-Diversity. In this section, we will prove that Tailor 
ensures transparent /-diversity. The core of our proof is an analysis on the set 5* 
of all possible microdata instances based on the adversary's external source E. We 
will show that S can be divided into several subsets, such that for each subset 
Ssub, (i) all instances in Ssub can be transformed to the same anonymization T* by 
Tailor^ and (ii) each individual in E is assigned many different sensitive values in 
different instances in Ssub- Intuitively, when the adversary observes T*, s/he would 
not be able to infer which instance in S^ub is the real microdata, and hence, the 
sensitive value of each individual can be concealed. 

More specifically, our analysis exploits the isomorphism between partitions. We 
say that a partition Pi of a possible microdata instance is isomorphic to a partition 
P2 of another instance, if and only if each Ql-group in Pi is isomorphic to a Ql-group 
in P2, and vice versa (see Section 2.1 for the definition of Ql-group isomorphism). 

Example 5 . Consider the partition P of T^ (in Table VIII) generated by Tailor 
in Example 4. P contains three Ql-groups, namely, G2 = {Ann, Bob, Gate, Don}, 
G4 = {Ed, Fred}, and G5 = {Gill, Hera}. The sensitive values of Ed and Fred are 
flu and bronchitis, respectively. Suppose that we modify the two tuples in G4 by 
swapping their Disease values, such that Ed has bronchitis and Fred has flu. The 
resulting Ql-group G4 is isomorphic to G4, while the partition P' = {G2,G'^,G5} 
isomorphic to P. Note that P' is not a partition of T5, but is in fact a partition of 
the microdata T3 in Table IV (this will be useful in demonstrating Lemma 3 later). 

□ 

Recall that, for any anonymization function / and any two isomorphic QTgroups 
Gi and G2, we have /(Gi) = /(G2) (see Definition 1). Therefore, once / is fixed, 
isomorphic partitions always lead to the same anonymization. For instance, con- 
sider the partitions P and P' in Example 5. Notice that, P' and the VIBR function 
decide Tg (in Table IX), which is determined by P and the MBR function as well. 
In addition, isomorphic Ql-groups have a crucial property: 

Lemma 2. LetG andG' be two isomorphic Ql-groups, and {Gi,G2} ({G'i,G'2}) 
be the canonical l-cut of G (G'). Then, Gi and G'^ (G2 and G'2) must involve the 
same set of individuals. 

The above lemma is fairly intuitive. Recall that, the canonical Z-cut of a QI- 
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group G depends only on the identifiers and QI values in G, and is independent of 
the sensitive values. Since isomorphic Ql-groups contain equivalent identifiers and 
QI values, their canonical ?-cuts divide them in the same way, and thus Lemma 2 
holds. Based on Lemma 2, we derive the following result, which shows an important 
characteristic of Tailor. 

Lemma 3. Let Ti be a microdata table, I he an integer, and T* — Tailor{Ti , I) . 
Let Pi he the partition of Ti that decides T* , P2 he a partition isomorphic to Pi, 
and T2 = UggPo Then, Tailor(T2,l) = T* , and P2 is the partition of T2 that 
decides T* . 

For instance, consider the microdata T^, T5 and the partitions P, P' in Example 5. 
We have shown in Example 4 that Tailor{Tr-,,2) ~ Tg*, where Tg* is decided by P. 
Recall that P' is isomorphic to P, and T3 = IJceP' According to Lemma 3, we 
have Tailor{T^,2) = Tg*, i.e., given I = 2, Tailor transforms both Ta and T5 into 

The following theorem shows a sufficient condition for transparent ^-diversity. 

Theorem 1. An I -diversity algorithm Q is transparent if it satisfies the following 
condition: For any microdata Ti such that Q{Ti,l) = T* , we have Q{T2,l) = T* 
for a microdata table T2, 1/^2 has a partition isomorphic to the partition of Ti that 
decides T* . 

By Lemma 3, Tailor satisfies the sufficient condition in Theorem 1, which proves 
that Tailor is a transparent algorithm. 

3.2 The Ace Algorithm 

This section discusses another algorithm. Ace (assign and slice), which first ap- 
peared in [Xiao and Tao 2007] as part of a solution to anonymizing dynamic 
datasets. Here, we present non-trivial proofs on the privacy guarantee of Ace against 
adversaries who have full knowledge of the algorithm. 

3.2.1 Algorithm Description. Let us first introduce several concepts. Given a 
QLgroup B, we define the signature of B as the set of sensitive values in P. A 
column of B refers to a maximal set of tuples in B with the same sensitive value. 
P is a bucket, if all of its columns contain an equal number of tuples. A partition 
P is a bucket partition, if each QLgroup in P is a bucket. 

For example, consider a QLgroup Pi of the microdata in Table VIII, where 
Pi = {Ann, Bob, Ed, Gill}. The signature of Pi is {dyspepsia, flu}. Pi contains 
two columns. Pi — {Ann, Gill} and P2 = {Bob, Ed}, where all tuples in Pi (P2) 
have sensitive value dyspepsia (flu). Since |Pi| = IP2I, Pi is a bucket. Let P2 and 
P3 be another two Ql-groups of T5, such that P2 = {Don, Fred} and P3 = {Gate, 
Hera}. It can be verified that, P2 and P3 are also buckets. Therefore, the partition 
Pi = {Pi, P2, P3} is a bucket partition of T5. Figure 4 illustrates Pi. 

Apparently, Pi is 2-diverse. Suppose that we divide Pi into two smaller buckets, 
P4 = {Ann, Bob} and P5 = {Gill, Ed}, both having the same signature as Pi. The 
partition U[ — {P2, P3, P4, P5} is also 2-diverse, and has a lower perimeter than 
Pi. In general, given any Z-diverse bucket partition U, we may reduce its perimeter 
by splitting the buckets in U, without violating /-diversity. This strategy is adopted 
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Ann 


Bob 


Gill 


Ed 


dyspepsia 


flu 





Don 


Fred 




Hera 


Gate 


gastritis 


bronchitis 




diabetes 


gastritis 



Bo 



B. 



Fig. 4. Bucket Partition Ui 



by Ace. In particular, whenever Ace splits a bucket B, the resulting sub-buckets 
always constitute a division of i?, as defined below: 

Definition 11 (Division). A division of a bucket B on AJ (i g [l,d]) is an 
ordered set {Ba,Bb} of buckets, such that: 

(1) BaUBb^ B, and BanBb = 0. 

(2) B, Ba and Bb have an identical signature. 

(3) For any two tuples ta G Ba and tb G Bb with the same sensitive value, we have 
either (i) ta[Al] < tblAj], or (it) = tb[Al] and tj^'l < 

The perimeter of the division equals the perimeter of {Ba,Bb}. A bucket is di- 
visible, if each of its columns has at least two tuples. 

Given a bucket B with x columns, we can obtain a division {Ba, Bb} of B on A| 
(« £ as follows. First, we sort the tuples in each column of B in ascending 

order of their A| values. Whenever two tuples have an identical value on A^, 
the tuple with a smaller identifier precedes the other. This results in x sorted 
sequences. To construct Ba, we can remove an equal number of tuples from the 
top of each sequence, and insert them into Ba- After that, Bb can be formed using 
the remaining tuples. 

A bucket may have multiple divisions. In a way similar to canonical /-cuts, we 
formulate canonical division as: 

Definition 12 (Canonical Division). The canonical division of a bucket 
B is the division with the smallest perimeter. In case multiple divisions have the 
smallest perimeter, the canonical division {Ba,Bb} is uniquely decided as follows. 
Assume {Ba,Bb} is on dimension A^ (i G [l,d]j; then: 

(1) No division on any A'j (j < i) has the same perimeter as {Ba,Bb}. 

(2) For any division {B'a,B'f^} on Af, if {Ba,Bb} and {B'a,B'j^} have the same 
perimeter, it must hold that \Ba\ < \B'a\. 

As with canonical Z-cuts, the canonical division of a bucket B is irrelevant to the 
sensitive values in B. Instead, it is decided only by the identifiers and QI values 
in each column. In Section 3.2.2, we will exploit this property to prove that Ace is 
transparent. 

Figure 5 illustrates the pseudo-code of Ace. Given a microdata T and a positive 
integer Ace first verifies whether T is /-eligible. After that, it invokes a subroutine 
Assign (in Figure 6) to construct an /-diverse bucket partition U of T. Next, Ace 
employs the Slice algorithm (in Figure 8) to split the buckets in U, and obtains a 
refined partition U' of T. In particular, the construction of U is performed without 
inspecting the QI values of the tuples, while the split of each bucket in U is based 
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Algorithm Ace (T, I) 

1. if T is not Z-eligible then return 

2. f/ = Assign{T, I) 

3. U' = Shce{U) 

4. return the generalization decided by U' and an anonymization function 

Fig. 5. Tlic Ace algorithm 

on canonical divisions, which are independent of the sensitive value in each column 
of the bucket. In other words, Assign and Slice do not rely on the correlations 
between the QI and sensitive values, which helps achieve transparent Z-diversity. 
Finally, Ace returns the generalization decided by U' . In the following, we explain 
the details of Ace with an example, assuming that the MBR function is adopted. 

Example 6. Assume that we apply Ace on the microdata T5 in Table VIII, with 
I = 2. Ace begins by checking whether T5 is Z-eligible. Since T5 is 2-eligible, Ace 
invokes Assign to construct a bucket partition U of T5. 

Assign first sets U — ^, and creates a set St containing all tuples in T5 (Lines 
1-3 in Figure 6). After that. Assign iteratively removes tuples from St to construct 
buckets in J7, until St is empty (Lines 4-13). In each iteration. Assign first counts 
the frequency of each sensitive value in St (Lines 5-6), and then builds a bucket 
B, such that (i) the signature of B consists of the /3 most frequent sensitive values 
in St, and (ii) each column of B contains a tuples in St- The values of a and (3 
are decided in Lines 7-10, which, as explained in [Xiao and Tao 2007], guarantee 
that (i) 13 > I, (ii) a > 1, and (iii) Assign always terminates^. For our discussion, 
it suffices to know that, a and /3 depend only on the size of St and the sensitive 
values in St- Since (3 > I, any bucket B created by Assign is Z-diverse. 

In the first iteration, St = T5, and a = /3 = 2 (calculated by Lines 7-10). 
Figure 7(a) illustrates the tuples in St- Assign first creates a bucket Bi whose 
signature consists of the (3^2 most frequent sensitive values in St- As shown in 
Figure 7(a), there exist three sensitive values in St, dyspepsia, flu, and gastritis, that 
have the same highest frequency. To pick two of the three diseases. Assign resorts 
to a total ordering. In general, any total ordering works, but for our illustration, we 
use the alphabetic order, in which case the signature of Bi is selected as {dyspepsia, 
flu}. Next, for each disease in the signature, Assign adds a = 2 tuples to Bi. As 
a consequence, Bi contains four tuples {Ann, Bob, Gill, Ed}, as illustrated in 
Figure 4. The tuples in Bi are then removed from St- 

In the second iteration, St contains four tuples, as shown in Figure 7(b). This 
time, a — 1 and 13 — 2. Hence, Assign yields a bucket B2 with signature {gastritis, 
bronchitis} {gastritis is picked as it has the highest frequency in St', bronchitis is 
chosen because it alphabetically ranks before diabetes). Accordingly, Assign inserts 
two tuples into B2'. one with a sensitive value gastritis, and the other one with 
bronchitis. As there are two tuples having bronchitis, the one to appear in B2 is 
randomly chosen; suppose that we pick Don. This leads to B2 — {Don, Fred}, 



^Intuitively, Assign always terminates, because (i) each iteration of Assign removes «-/3 > tuples 
from St, and hence, (ii) St will become empty after a certain number of iterations, in which case 
Assign stops by returning the bucket partition U it constructs (sec Lines 3 and 13 in Figure 6). 
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Algorithm Assign (T, I) 

1. initialize a partition [/ = 

2. TO = the number of distinct A" value in T 

3. St^T 

/* The tuples in St will be iteratively removed to construct buckets in U */ 

4. while St ^9 

/* Lines 5-12 create a new bucket in U using tuples from St */ 

5. let Vi {i £ [l,w]) be the i-th most frequent A" value in the current St 
/* Ties are resolved by a total ordering on A" (see Example 6) */ 

6. let Ui {i £ [1, w]) be the number of tuples in St with sensitive value Vi 

7. 13 = 1 

/* the new bucket's signature will contain the /3 most frequent A" values in St */ 

8. a = the largest positive integer satisfying three inequalities: 

a < n/3, rii — Q < ^'^""''^ , and n/3+1 < ^"^'^""'^ 
/* the new bucket will contain a tuples for each sensitive value in its signature */ 

8. if Q does not exist 

9. 13 = P + l; goto Line 7 

10. create in f/ a bucket B with a signature {v\, V/j} 

11. for i = 1 to /3 

12. from St, randomly remove a tuples whose sensitive values equal Vi, and insert 
those tuples into B 

13. return U 



Fig. 6. Tlie Assign algorithm 



Tuples in St 

Fred {bronchitis) 
Hera [diabetes) 
Ann, Gill [dyspepsia) 
Bob, Ed [flu) 
Gate, Don [gastritis) 

(a) Before Bi Is Constructed 



Tuples in St 

Fred [bronchitis) 
Hera [diabetes) 

Gate, Don [gastritis) 
(b) Before B2 Is Constructed 



Tuples in St 



Hera [diabetes) 

Gate [gastritis) 
(c) Before Is Constructed 



Fig. 7. Changes in St During the Execution of Assign in Example 6 



Algorithm Slice [U) 

1. while there exists a divisible bucket B inU 

2. {Ba, -Bb} = the canonical division of B 

3. U = U -{B} + {Ba,Bi} 

4. return U 



Fig. 8. The Slice algorithm 



as illustrated in Figure 4. Don and Fred are then evicted from St, as shown in 
Figure 7(c). 

Similarly, the third iteration constructs a bucket — {Hera, Gate} (see Fig- 
ure 4). Then, St becomes empty, and hence. Assign terminates with a bucket 
partition f7 = { i?i , ,62 , S3 } • 

As the second step. Ace applies Slice to divide the buckets in U into smaller QI- 
groups. Slice also runs in iterations. In each iteration, it first identifies a divisible 
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Age 


Zipcode 


Disease 


Name 


Age 


Zipcode 


Disease 


[21. 27] 


[10k, 18k] 


dyspepsia 


Ann 


21 


10000 


flu 


[21, 27] 


[10k, 18k] 


flu 


Bob 


27 


18000 


dyspepsia 


[54, 60] 


[60k, 63k] 


dyspepsia 


Gate 


32 


35000 


gastritis 


[54, 60] 


[60k, 63k] 


flu 


Don 


32 


35000 


gastritis 


[32, 60] 


[35k, 63k] 


gastritis 


Ed 


54 


60000 


dyspepsia 


[32, 60] 


[35k, 63k] 


bronchitis 


Fred 


60 


63000 


bronchitis 


[32, 60] 


[35k, 63k] 


diabetes 


Gill 


60 


63000 


flu 


[32, 60] 


[35k, 63k] 


gastritis 


Hera 


60 


63000 


diabetes 



Table X. Generalization Table XI. IMicrodata Tg 



bucket B in U (Line 1 in Figure 8), and then, splits B using its canonical division 
{Ba,Bh}. This is repeated until no bucket in U is divisible. 

In our example, the input to Slice is the bucket partition U — {Bi, B2t B^} in 
Figure 4. Bi is the only divisible bucket. To determine the canonical division of Bi, 
Slice finds the best division on each dimension (with the lowest perimeter). It turns 
out that, on both dimensions Age and Zipcode, the best division is {-64,-65}, where 
-64 — {Ann, Bob} and -65 = {Gill, Ed}. Thus, {-64,-65} becomes the canonical 
division. Therefore, Slice removes Bi from U, and inserts -64 and -65 instead, 
leading to U — {-62,-63, -64,-65}. As no bucket in U is divisible. Slice returns U to 
Ace. Finally, Ace reports the generalization Tj (in Table X) decided by U. □ 

Ace is a randomized algorithm, due to the randomness in its component Assign. 
Furthermore, Ace has an O(n^logn) time complexity, where n is the number of 
tuples in T. To understand this, observe that Assign runs in 0(n) time (we regard 
the number of distinct A" values in T as a constant). On the other hand. Slice 
has an 0{n^ ^ogn) time complexity, since (i) each bucket B generated from Assign 
is divided by Slice exactly \B\/l times, (ii) each division of B incurs 0{\B\ log \B\) 
overhead, and (iii) the sizes of all buckets add up to n. Since Ace is a composition 
of Assign and Slice, its time complexity is 0{n^ logn). 

3.2.2 Proof of Transparent l-Diversity. This section proves that Ace achieves 
transparent /-diversity. Our analysis utilizes a crucial concept, the symmetry be- 
tween buckets. 

Definition 13 (Symmetry). Two buckets Bi and B2 are symmetric, if and 
only if (i) Bi and B2 have the same signature, and (ii) for any column Li C 
Bi, there exists a column L2 C B2, such that Li and L2 involve the same set of 
individuals. Two bucket partitions Ui and U2 are symmetric, if each bucket in Ui 
is symmetric to a bucket in U2, and vice versa. 

Consider, for example, the bucket partition C/i in Figure 4. Bucket -61 G C/i 
contains two columns Li = {Ann, Gill} and L2 = {Bob, Ed}. Suppose that we 
exchange the sensitive values between Li and L2, by setting the sensitive values of 
the tuples in Li {L2) to flu {dyspepsia). Then, we obtain a bucket B'^ symmetric 
to -61, as shown in Figure 9. The bucket partition U2 = {-6^,-62,-63} is symmetric 
to Ui. In general, we can obtain any symmetric counterpart of a bucket B, by 
swapping the sensitive values between different columns of B. 
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Fig. 9. Bucket partition U2 



Interestingly, the canonical division of a symmetric bucket always results in sym- 
metric sub-buckets: 

Lemma 4. Let B and B' be two symmetric buckets, and {Bi, B2} ({B[, B'2}) be 
the canonical division of B (B' ). Then, Bi and B[ (B2 and B'2) are symmetric. 

The rationale behind Lemma 4 is similar to that of Lemma 2. Specifically, since 
B and B' are symmetric, each column L in B can be mapped to a column L' in B' , 
such that L and L' involve an identical set of identifiers and QI values. Recall that 
the canonical division of a bucket depends only on identifiers and QLvalues, and 
is irrelevant to sensitive values. Hence, the canonical division of B has the same 
effect as that of B', thus establishing Lemma 4. The lemma naturally leads to the 
following result. 

Lemma 5. LetUi andU2 be two symmetric bucket partitions. LetU[ — Slice{Ui) 
and U'2 = Slice{U2). Then, U[ and U2 are symmetric. 

Assign also has an interesting property related to symmetric buckets: 

Lemma 6. Let Ti be a microdata table, I an integer, and Ui a possible output of 
Assign{Ti,l). Let U2 he a bucket partition symmetric to Ui, and T2 = Usec/a ^• 
Then, Pr{Assign{Ti,l) = Ui} = Pr{Assign{T2,l)^U2} . 

For instance, consider the symmetric bucket partitions Ui and U2 in Figures 4 
and 9, respectively. Ui {U2) is a partition of the microdata T5 in Table VIII (Tg in 
Table XI). By Lemma 6, the probability that Assign{T5,2) returns Ui equals the 
probability that Assign{Ts,2) outputs U2. 

We prove that Ace ensures transparent /-diversity by combining Lemmas 5 and 
6 with the following theorem, which states a sufficient condition for transparent 
Z-diversity. 

Theorem 2. Let Qa and Qb he two algorithms as follows: 

(1) Qa takes as input a microdata table Ti and a positive integer I, and outputs a 
bucket partition Ui ofTi, such that for any bucket partition U2 symmetric to 
Ui, we have Pr{gA{Ti,l) = Ui} = Pr{gA{T2,l) = U2} , where T2 ^[jseih^ ■ 

(2) Qb is a deterministic algorithm that takes as input a bucket partition U and 
outputs another bucket partition, such that for any bucket partition U' symmet- 
ric to U, Qb{U) is always symmetric to Gb{U'). 

Let Q be an I -diversity algorithm that first applies Qa on the input microdata, 
then invokes Qb on the bucket partition output from Qa, and finally returns the 
anonymization decided by the bucket partition generated from Qb . Q is transparent. 
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Algorithm Hybrid (T, Z) 

1. if T is not Z-eligible then return 

2. Go — a Ql-group containing all tuples in T, and P = {Go} 

3. while there exists a 2Z-diverse Ql-group G in P 

4. {Ga, Gb} = the canonical l-cui of G 

5. P = P-{G} + {Ga,Gb} 

6. T* = 

7. for each Ql-group Gi £ P 

8. T* = Ace(Gi,l) 

9. r* = r* u t;* 

10. return T* 

Fig. 10. The Hybrid algorithm 

By Lemma 6 (Lemma 5), Assign (Slice) satisfies the requirements for Qa (Sb) 
stated in Theorem 2; therefore, Ace (as a combination of Assign and Slice) is a 
transparent algorithm. 

3.3 The Hyjbr/o' Algorithm 

This section develops a new algorithm Hybrid that combines Tailor and Ace. Hybrid 
is motivated by, and overcomes the drawbacks of. Tailor and Ace. We will first 
explain those drawbacks, and then, elaborate the details of Hybrid. 

Given a microdata T and an integer I, Tailor initiates a partition P — {T}, and 
then iteratively refines P, by splitting the QLgroups of P into smaller ones. How- 
ever, once a Ql-group violates 2Z-diversity, it is ignored by Tailor, even if it can be 
further divided. As a result. Tailor sometimes spawns Ql-groups with many tuples, 
entailing high information loss. For example, consider the 2-diverse generalization 
Tg (Table IX), which is produced by Tailor in Example 4. The first QLgroup Gi 
in Tg* has four tuples {Ann, Bob, Gate, Don} in T5 (Table VIII). In fact, Gi can 
be further split into 2-diverse Ql-groups {Ann, Gate} and {Bob, Don}. Tailor fails 
to see the split because Gi is not 4-diverse. 

Ace does not suffer from the above defect, but its random nature may occasionally 
create poor Ql-groups. Recall that. Ace employs Assign to obtain an Z-diverse 
bucket partition U of T. Let us revisit the way Assign builds a bucket B in U: 
Assign first decides the signature of B, and then determines each column in B, using 
tuples randomly selected from T. The distribution of QI values in each column of 
B may vary significantly. For instance, in Example 6, Assign generates a bucket B3 
with signature {diabetes, gastritis}. Diabetes usually affects people over 40, while 
gastritis is common for all ages. Therefore, when Assign constructs the diabetes 
column, the random samples from T are likely to have large ^176 values. In contrast, 
the gastritis column may contain individuals with any ages. 

This (Ql-distribution) difference becomes problematic in Slice, which Ace deploys 
to refine the bucket partition U output by Assign. As explained in Section 3.2.1, 
Slice splits each bucket B £ U into non-divisible buckets (a.k.a Ql-groups), each 
of which has exactly one tuple from every column of _B. If the columns of B 
have diverse Ql-distributions, the tuples in a final non-divisible Ql-group may have 
dissimilar QI values. After anonymization, such a Ql-group would incur large 
information loss. 
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Age 


Gender 


Education 


Birthplace 


Occupation 


Income 


Size 


79 


2 


17 


57 


50 


50 



Table XII. Attribute domain sizes 



Hybrid, as in Figure 10, remedies the deficiencies of Tailor and Ace by running 
the two algorithms consecutively. Specifically, Hybrid first computes a partition P 
of T using Tailor. In particular. Lines 1-5 in Figure 10 are identical to Lines 1-5 in 
Figure 3. As the second step, Hybrid treats each Ql-group in P as a tiny microdata 
table, and invokes Ace to generalize the Ql-group (Lines 6-10). 

By employing Ace to refine P, Hybrid outputs Ql-groups with (much) fewer 
tuples than Tailor, thus avoiding the defect of Tailor. Meanwhile, compared to 
Ace, Hybrid incurs lower information loss, by executing Ace on each Ql-group in 
P, where tuples already have similar QI values. The following theorem shows that 
Hybrid is transparent. 

Theorem 3. Let T be a microdata table, I be a positive integer, and T* be 
any possible output of Hybrid(T,l). Given any external source E for T, we have 
risk{o) < l/l for any individual o. 

Finally, we point out that Hybrid has an 0{n^ log n) time complexity, where n 
is the number of tuples in T. This follows from the 0{n^ logn) complexity of both 
Tailor and Ace. 

4. EXPERIMENTS 

In the earlier sections, we have proved the privacy guarantees of our transparent 
algorithms. A natural question is, how do they compare with the existing solutions 
in terms of data utility and computation overhead (remember that no previous 
solution is transparent, i.e., it does not ensures anonymity, when an adversary 
knows the algorithm details)? In the sequel, we answer this question with empirical 
evidence that validates the effectiveness and efficiency of our algorithms. First, 
Section 4.1 clarifies the experiment settings, and then Sections 4.2 and 4.3 present 
detailed results. 

4.1 Experimental Setting 

Following previous work [Ghinita et al. 2007; Xiao and Tao 2007], we employ two 
real-world datasets, OCC and SAL, extracted from the Integrated Public Use Mi- 
crodata Series [Ruggles et al. 2004]. Both datasets consist of 600k tuples, each 
containing the information of an American adult. OCC has a sensitive attribute 
Occupation, and four QI attributes. Age, Gender, Education, and Birthplace. SAL 
has the same QI attributes, but a different sensitive attribute Income. All attributes 
have integer domains. Table XII presents their domain sizes. 

We compare our techniques (adopting the MBR function) against two /-diversity 
generalization algorithms, Mondrian [LeFevre et al. 2006a] and Mask [Wong et al. 
2007]. The former is a popular technique in the literature [Byun et al. 2006; LeFevre 
et al. 2006b; Nergiz et al. 2007; Pei et al. 2007], due to its simplicity and effective- 
ness. Mask, on the other hand, is an existing approach that does not assume 
adversaries with zero algorithm knowledge (nevertheless, as explained in Sections 1 
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Parameter 


Values 


I 


6, 7, 8, 9, 10 


Query dimensionality qd 


2, 3, 4, 5 


Expected selectivity s 


2%, 4%, 6%, 8%, 10% 



Table XIII. Parameters and Tested Values 



and 2.3, Mask is not transparent, as it can prevent only minimality attacks). We 
apply each algorithm to compute ^-diverse generalizations of OCC and SAL, using 
various values of . Note that the generalizations produced by our solutions are 
guaranteed to be transparent /-diverse, whereas those by the other methods are 
not. 

In accordance with [Ghinita et al. 2007; Wong et al. 2007; Xiao and Tao 2007], 
we evaluate the utility of a generalized table T* by using it to answer count queries 
about the underlying microdata T . Each query has the form: 

SELECT COUNT(*) FROM T 

WHERE pred{A\) AND ... AND pred{Al) AND pred{A'') 

where pred{A) denotes a predicate on A. Predicates are generated based on two 
parameters: query dimensionality qd and expected selectivity s. Specifically, given 
gde [2, 5] and sS (0, 1), we create a set Sa that contains the sensitive attribute A^ 
of T, and qd~l QI attributes randomly selected. Then, for each Ag Sa, we set 
pred{A) to "^G/", where / is a random interval on A, enclosing a fraction s^/'?'' 
of the values in A. Finally, for each A' ^ Sa, pred{A') is = By requiring 
qd > 2 and A^ G Sa, we aim to examine how well T* preserves the correlation 
between the QI and sensitive attributes. 

On each generalized table, we process several query workloads, each of which 
contains 1000 queries with identical qd and s. We gauge the utility of T* by the av- 
erage workload error computed as follows. For each query, we derive its exact result 
act from T, and compute an estimated answer est from T* using the approximation 
technique in [LeFevre et al. 2006a]. The error of est is defined as J^'^^^fs} ' '^ti^re 
6 is set to 0.5% of the dataset cardinality. Then, the workload error equals the 
average error of all queries in the workload. Note that 6 is introduced to prevent 
the workload error from being dominated by queries with exceedingly small results 
(similar approaches are adopted in [Garofalakis and Kumar 2005; Vitter and Wang 
1999]). 

Table XIII summarizes the experiment parameters. Unless otherwise specified, 
we always set the parameters to their default values, i.e., the bold numbers in 
Table XIII. All experiments are performed on a computer with a l.SGHz CPU and 
1GB memory. 

4.2 Utility of Generalization 

The first set of experiments evaluates the information loss incurred by each algo- 
rithm. Figure 11 illustrates the results as a function of I. As expected, the error 

^ Mask requires two parameters k and I {k > I) to generate an Z-diverse table. We set k = I in our 
experiments, since a smaller k leads to a generalized table with higher utihty, as shown in [Wong 
et al. 2007]. 
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Fig. 11. Query Accuracy vs. I 





Age 


Gender 


Education 


Birthplace 


Occupation 


0.49 


0.62 


0.28 


0.38 


Income 


0.71 


0.87 


0.41 


0.50 



Table XIV. Correlation Ratio between Attributes 



of all methods escalates with Z, since a larger I implies a more stringent anonymity 
requirement, which, in turn, demands more aggressive generalization. Hybrid and 
Mondrian have the best overall performance. This is a strong evidence indicating 
that the heuristics of Hybrid are highly effective. In particular, even though Hybrid 
must guarantee transparency, it still offers almost the same utility compared to 
Mondrian (which is non-transparent). 

Tailor and Ace exhibit worse performance than Hybrid. This is not surprising 
because, as mentioned in Section 3.3, Hybrid is designed to overcome the short- 
comings of Tailor and Ace. Mask incurs larger error than Hybrid in all cases, even 
though the former is vulnerable to adversaries with full algorithm knowledge (recall 
that Mask prevents only minimality attacks) . 

Each algorithm demonstrates similar behavior regardless of the dataset, except 
that Ace performs worse on SAL than on OCC. To explain this, we observe that 
the incomes depend heavily on people's ages and education. Hence, when Ace 
employs Assign to create a partition U of SAL, each bucket in U contains tuples 
with very different QI values, due to the reason explained in Section 3.3. As a result, 
the QLgroups returned by Ace have long generalized intervals, rendering low data 
utility. The above phenomenon does not exist on OCC because occupation is much 
less correlated to the QLattributes. To support our analysis. Table XIV shows the 
correlation ratios [Kendall and Stuart 1979] between the QI and sensitive attributes 
of OCC and SAL. A larger ratio indicates stronger correlation. 

To study the influence of query dimensionality gd. Figure 12 plots the workload 
error as a function of qd. The relative performance of alternative algorithms re- 
mains the same as in Figure 11. In particular. Hybrid and Mondrian permit highly 
accurate counting analysis; their maximum error is less than 10%. Each algorithm 
has better query precision when the query dimensionality qd is higher. To under- 
stand this, recall that each query predicate either includes the whole domain of 
an attribute, or is an interval covering si/^'* of the domain. When s is fixed but 
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Fig. 12. Query Accuracy vs. Query Dimensionality 
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Fig. 13. Query Accuracy vs. Expected Selectivity s 



qd increases, s^/?'^ becomes greater, implying wider query intervals, which lead to 
smaller error, as explained in [Xiao and Tao 2006a]. Figure 13 shows the error when 
the expected selectivity s grows from 2% to 10%. Again, the relative superiority 
of different algorithms is the same. Their error decreases when s increases, as is 
consistent with the experiment results in [Ghinita et al. 2007; Wong et al. 2007; 
Xiao and Tao 2007]. 

In summary. Hybrid and Mondrian produce generalizations with similar data 
utility, and both significantly outperform Tailor^ Ace, and Mask. Therefore, overall 
Hybrid is the best anonymization technique, since it promises much stronger privacy 
guarantee than Mondrian. 

4.3 Computation Overhead 

Having examined the effectiveness of the proposed solutions, we proceed to evaluate 
their efficiency. In order to inspect their scalability with the dataset cardinality, 
based on OCC (SAL), we generate microdata tables with various cardinalities. 
Specifically, given a multiple n of 600k, a table with n tuples is synthesized by 
including n/600k copies of OCC (SAL). Figure 14 shows the generalization time of 
each method, as a function of n. The running time of Mask exhibits a superlinear 
increase with n, while the other algorithms scale almost linearly. Hybrid requires 
slightly higher overhead than Mondrian. This is not a serious disadvantage because 
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Fig. 14. Computation Time vs. Dataset Cardinality n 



■'^—Tailor A Ace — B — Hybrid Q Mask X Mondrian 




6789 10 6789 10 

/ / 

(a) OCC (b) SAL 



Fig. 15. Computation Time vs. I 



(i) the difference is not large, (ii) the disadvantage is the compensated by the trans- 
parency of Hybrid, and (iii) anonymization is an offline process, so it is reasonable 
to spend a little more time preparing a publication that safeguards privacy better. 

Utilizing the 600k datasets, in Figure 15, we inspect the computation overhead 
as a function of /. The running time of Ace, Mondrian, and Mask is insensitive 
to /. In contrast, the processing cost of Tailor and Hybrid decreases rapidly as I 
grows. Recall that, Tailor works by iteratively dividing Ql-groups, until all QI- 
groups violate 2Z-diversity. As I increases, fewer 2/-diverse Ql-groups exist; hence 
Tailor terminates earlier. Hybrid has similar behavior because it deploys Tailor as 
the first step. 

In summary, Hybrid is ideal for practical applications because its computation 
cost enjoys linear scalability to the dataset cardinality. In particular, it anonymizes 
a dataset with nearly 10 million tuples within 5 minutes (see Figure 14). 

5. RELATED WORK 

The works closest to ours are due to Wong et al. [2007] and Zhang et al. [2007]. 
Since they has been discussed extensively in Sections 1 and 2.3, the following review 
concentrates on the rest of the literature on privacy preserving data publishing. 

A bulk of the literature focuses on designing privacy principles. The earliest prin- 
ciple, /c-anonymity [Samarati 2001], requires that every Ql-group should contain at 
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least k tuples. Machanavajjhala et al. [2007] point out that a fc-anonymous table 
may still incur privacy breach, unless each Ql-group includes sufficiently diverse 
sensitive values. This observation leads to the concept of /-diversity, which has sev- 
eral instantiations, e.g., entropy /-diversity, recursive (c, /)-diversity, as discussed in 
Section 2.1. Besides A:-anonymity and Z-diversity, numerous other privacy principles 
[Byun et al. 2006; Chen et al. 2007; Kifer and Gehrke 2006; LeFevre et al. 2006b; Li 
et al. 2007; Martin et al. 2007; Nergiz et al. 2007; Wang and Fung 2006; Wong et al. 
2006; Xiao and Tao 2006b; 2007; Wong et al. 2007; Zhang et al. 2007; Zhang et al. 
2007] have been developed to offer different flavors of privacy protection, by plac- 
ing various constraints on the contents of Ql-groups. Our transparent /-diversity 
principle distinguishes itself from all the previous principles, in that it guarantees 
privacy even when the anonyniization process is public knowledge. 

Generalization algorithms is another well-explored topic [Aggarwal et al. 2006; 
Bayardo and Agrawal 2005; Fung et al. 2005; Ghinita et al. 2007; LeFevre et al. 
2005; 2006a; 2006b; Iyengar 2002; Wang et al. 2004; Wong et al. 2006; Xiao and 
Tao 2007; Xu et al. 2006; Wong et al. 2007; Zhang et al. 2007]. These solutions 
aim at minimizing the information loss, according to different anonymization con- 
straints (e.g., global/local receding) and measurements of loss (e.g., discernibility). 
Many of them are initially devised for /c-anonymity, but can be modified to sup- 
port /-diversity and other principles, as explained in [Machanavajjhala et al. 2007]. 
However, except the algorithms proposed in [Xiao and Tao 2007; Zhang et al. 2007], 
none of these algorithms is transparent. In other words, they can no longer ensure 
the privacy guarantee of the underlying principle, when an adversary is aware of 
the details of the algorithm. 

Other problems related to generalization have also attracted considerable re- 
search efforts. Specifically, optimal /e-anonymous generalization has been shown 
to be NP-hard in [Aggarwal et al. 2005; Meyerson and Williams 2004; Park and 
Shim 2007] , which also develop approximation algorithms with provable worst-case 
quality guarantees. Aggarwal [2005] shows that when the number of QI attributes 
is large, it is simply impossible to achieve A:-anonymity without substantial infor- 
mation loss (even when k is small). Xiao and Tao [2006a] develop anatomy as an 
alternative anonymization technique that achieves higher data utility than gener- 
alization does. 

In addition, there exist several anonymization techniques [Agrawal et al. 2005; 
Dwork et al. 2006; Evfimievski et al. 2003; Machanavajjhala et al. 2008; Tao et al. 
2008] that do not adopt generalization. Instead, they anonymize microdata by 
adding random "noise" into the data, i.e., by replacing a fraction of tuples in the 
microdata with randomly generated tuples [Agrawal et al. 2005; Evfimievski et al. 
2003; Tao et al. 2008], or by deriving the tuple distribution in the microdata and 
then publishing a noisy version of the distribution [Dwork et al. 2006; Machanava- 
jjhala et al. 2008]. These techniques are designed by assuming that the process for 
generating random "noise" is known to the public, and hence, they do not suffer 
from reverse engineering attacks. 
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6. CONCLUSIONS 

Most existing anonymization techniques fail to protect privacy against adversaries 
with fuU knowledge of the anonymization mechanism. In this paper, we remedy the 
problem with two important contributions. First, we provide a thorough analysis 
on the disclosure risks in the anonymized tables, assuming that the anonymization 
algorithm is public knowledge. This analysis leads to the formulation of transpar- 
ent Z-diversity, which ensures small disclosure risks in an anonymized table, even 
if everything involved in the anonymization process, except the microdata, is re- 
vealed to the public. Second, we identify three anonymized algorithms that can 
enforce transparent /-diversity, and demonstrate their practical usefulness through 
extensive experiments. 

This work also lays down a solid foundation for future research. First, our anal- 
ysis focuses on /-diversity due to its popularity in the literature. However, the 
concept of transparent anonymization is general, and can be integrated with any 
other principle (e.g., t-closeness [Li et al. 2007], (5-presence [Nergiz et al. 2007]). 
It is an interesting direction to design transparent generalization algorithms for 
those principles. Second, the proposed solutions are heuristic in nature, and do not 
have attractive asymptotical performance guarantees. It is a challenging problem to 
study theoretical transparent algorithms. Note that the existing findings (including 
the complexity results, approximation algorithms, etc.) were derived for conven- 
tional generalization, and hence, are not immediately applicable to transparent 
anonymization. 



APPENDIX 

Proof of Proposition 1. Observe that, the adversary's knowledge about the 
external source E can be expressed a.s T E S, since S consists of all microdata 
tables that involve the individuals in E. Furthermore, if T £ S'o,u, then o has a 
sensitive value v in T, and vice versa. Hence, 

risk(o) = max Pr\o has v in T \ E AG AT* Al} 
= max Pr{T e So v \ T e S A G AT* A 1} 

VGA'' ^ 



= max Pr\T e So V \ T e S A g{T,l) = T*} 

veA^ ^ 

Pr{T eSovATeSA g{T, I) = T*} 

= max ; 

veA^ Pr{T e S AGiTJ) ^T*} 

Pr{TeSo,,Ag{T,l)^T*} ^ ^ 

= max ; ^ — (smce Sov'^S) 

veA^ Pr{T e S Ag{T,l) ^T*} ^ o^v _ , 

T.fGS.^^ {Pr{T ^ f] ■ Pr{g{f, I) ^ T*}) 
Efes iP^{T - T} ■ Pr{g{T,l) = T*}) 

Recall that, each possible microdata instance in S is equally likely for the adversary, 
before s/he observes T*. That is, for any fi,f2 G S, we have Pr{T ^ fi} ^ 
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Prjr^fs}. Thus, 

T.f^s.AP''{T-T]-Pr{g{f,l)^T*]) 

riskio) = max '— r— 

"^^^ Ef 65 {P-^iT - T] ■ Pr{G{T, I) = T*]) 

= max , 

which completes the proof. □ 

Proof of Lemma 1. Assume by contradiction that Opt- Gen is not a minimal 
algorithm. Then, there exists a microdata table T and a positive integer such 
that rf = Opt-Gen{T, I) is not a minimal Z-diverse generalization of T, with respect 
to the MBR function / and the global recoding scheme. Let Pi be the partition of 
T that decides Tj*. By Definition 7, there should be a child P2 of Pi, such that P2 
and / decide a generalization Tj* that conforms to the global recoding scheme. 

According to Definition 6, (i) there exists a unique QLgroup Gi in P that does 
not appear in P2, and (ii) P2 contains only two QLgroups G2 and G3 that are 
not included in Pi. Furthermore, since Gi = G2 U G3 and G2 H G3 = 0, we have 
IG1HIG2I + IG3I. Thus, 

J2 \G\' = \Gi\' + 
GePi GePi-{Gi} 

>|G2p + |G3p+ Yl 1^1' 
GePi-{Gi} 

GePi-{Gi} + {G2.G3} 

= E i^r, 

GeP2 

which contradicts the fact that Opt- Gen minimizes the discernability of the gener- 
alized tables. Hence, the lemma is proved. □ 

Proof of Lemma 2. Let G3 (G4) be the set of tuples in G', such that G3 and Gi 
(G4 and G2) involve the same set of individuals. To prove the lemma, it suffices to 
show that {G3, G4} is the canonical Z-cut of G'. 

Without loss of generality, assume that {Gi, G2} is an Z-cut of G on AJ {i G [1, cZ])- 
We will first prove that {G3, G4} is an Z-cut of G' on Af, i.e., {G3, G4} satisfies the 
three conditions in Definition 9. Observe that the first condition trivially holds. 
Let V be the most frequent A* value in G, and c be the number of tuples in G with 
a sensitive value v. Since G and G' are isomorphic, they contain the same multi-set 
of A'^ values. Therefore, c is also the maximum number of tuples in G' with an 
identical sensitive value. Since jGgl = |Gi| > c - 1 and IG4I = IG2I > c - 1, {G3, G4} 
fulfills the second condition in Definition 9. 

Assume by contradiction that, {G3,G4} violates the third condition in Defini- 
tion 9. There should exist t'^ S G3 and ^4 S G4, such that (i) t'^[Al] > t'^lAj], or 
(ii) t^jiAf] ^ t'^[Al] and t':^[A"^] = t'^[A"^]. Let ti (^2) be the tuple in Gi (G2), such 
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that ti and ts (t2 and ti) concern the same individual. Then, ti and t'^ {t2 and t\) 
should have the same QI values. As a result, we have either (i) > i2[^i], 

or (ii) ti[Al] = t2[Af\ and ti[A'-'^] > t2[A''^]. This contradicts the assumption that 
{Gi,G2} is an ^-cut of G. Therefore, {G(,,G;} is an /-cut of G' on Aj. 

Next, we will show that {G3,G4} is canonical. Assume that this is not true. 
Then, by Definition 10, at least one of the following three conditions must hold: 

(1) Among the /-cuts of G', the perimeter of {Gg, G4} is not the smallest. 

(2) There exists an /-cut {G'5, G^} of G' on (j < i), such that hp{G'5) + hp{G'e) = 
hp{G',) + hp{G'^). 

(3) There exists an /-cut {G[;,G^} of G' on such that \G'^\ < |G(j|, and /ip(G'5) + 
hp{G',)^hp{G',) + hp{G',). 

Consider that Condition 3 is satisfied. Let G5 (Gg) be the set of tuples in G, such 
that G5 and G5 (Gg and Gg) contain the same set of individuals. It can be verified 
that {G5,Gg} is an /-cut of G on Af, and hp{G5) + hp{Ge) = hpiG'^) + hp{G'Q). 
Then, 

hp{G5) + hp{Ge) = hp{G',) + hp{G'e) = hp{G'^) -f hp{G'^) = hp{Gi) + hp{G2). 

Furthermore, IG5I = IG5I < IG3I = |Gi|. This contradicts the assumption that 
{Gi, G2} is the canonical /-cut of G. 

Similarly, it can be shown that when Condition 1 or 2 holds, {Gi, G2} cannot be 
the canonical /-cut of G, leading to a contradiction. Thus, {Gg, G4} should be the 
canonical /-cut of G', which completes the proof. □ 

Proof of Lemma 3. Let Tj* = Tailor{T2,l), and P3 be the partition of T2 that 
decides T2. We will prove the lemma, by showing that (i) Pi and P3 are isomorphic, 
and (ii) P2 = Ps- The former guarantees that T2 = T*, since isomorphic partitions 
always lead to the same anonymization. 

To facilitate our proof, we construct a binary tree Ri of QTgroups as follows. 
First, we set the root of Ri to Ti. Then, we apply Tailor on Ti with the given / 
value, and monitor the execution of Tailor. As shown in Figure 3, Tailor will first 
construct a partition P — {Gq}, with Go = Ti. Then, each time Tailor computes 
the canonical /-cut {Gi,G2} of QTgroup G G P, we insert Gi and G2 (into R) as 
the child nodes of G. As such, after Tailor terminates, each leaf of i?i is a QLgroup 
in Pi , and vice versa. We refer to Ri as the split history of Ti . Following the same 
methodology, we also construct the split history R2 of T2, such that the leaves of 
i?2 constitute P3. 

Next, we will prove that Pi is isomorphic to P3, by showing that each leaf of i?i 
is isomorphic to a leaf of i?2, and vice versa. Our proof is by induction. For the 
base case, let us consider the roots of Ri and i?2- Let Gi (G2) denote the root of Ri 
(i?2)- We have Gi — Ti and G2 = T2. Since Pi and P2 are isomorphic, Ti and T2 
should also be isomorphic, because Ti = UcePi ^ ^'^'^ ^ Uceft Therefore, 
Gi is isomorphic to G2. 

As a second step, assume that two nodes G3 £ i?i and G4 G R2 are isomorphic. 
We will establish two propositions: 

— Proposition 1. G3 is a leaf of if and only if G4 is a leaf of R2. 
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— Proposition 2. If G3 is not a leaf, then each child of G3 is isomorphic to a child 
of G4. 

Observe that, G3 (G4) is a leaf of i?i {R2), if and only if it is not 2Z-diverse, 
otherwise it would have been divided into smaller parts by Tailor. Since G3 and G4 
are isomorphic, if G3 is not 2Z-diverse, G4 must violate 2Z-diversity, and vice versa. 
Therefore, Proposition 1 holds. 

Now assume that G3 is not a leaf. Let {Ga,Gb} and {G^,G'f,} be the canonical 
Z-cuts of G3 and G4, respectively. By Lemma 2, Ga and G'^ (Gf, and GJ,) contain 
the same set of individuals. We will show that Ga (Gf,) is isomorphic to G'^ (GJ,). 

Consider the set Sa of leaves under the subtree of Ga- We have IJceS G = Ga- 
Since Pi and P2 are isomorphic, there exists a subset S'a of P2, such that each 
G e Sa is isomorphic to some G' G S'a, and vice versa. Let G5 — Ug'gS' ^' ■ Then, 
G5 is isomorphic to Ga, which indicates that G5 and Ga involve the same set of 
individuals. Recall that Ga and G'a also contain an identical set of individuals. 
Hence, each individual in G^ appears in G5, and vice versa. Because both G'a and 
G5 are subsets of T2, we have G'a = G5. Consequently, G'a is isomorphic to Ga- 
Similarly, it can be verified that the Gf, and Gj, are isomorphic. Thus, Proposition 
2 is valid. By induction, each leave of Ri is isomorphic to a leaf of R2, and vice 
versa. Hence, Pi is isomorphic to P3. 

To complete the proof, it remains to show that P2 — P3- Since both P2 and P3 
are isomorphic to Pi, P2 must be isomorphic to P3. Therefore, for each QLgroup 
G e i-2, there exists G' € P3, such that G and G' involve the same set of individuals. 
This indicates that G = G' , since the both G and G' are subsets of T2- Therefore, 
P2 — P3, which proves the lemma. □ 

Proof of Theorem 1. Let T be any microdata table, I be any positive integer, 
and T* — Q{T,l)- Let E be any external source, and G be the set of possible 
microdata instances based on E, such that Q{T, I) = T* for any T G C- Let o be 
any individual, v be an arbitrary sensitive value, and C the subset of G, such that 
each T G C contains a tuple t with — o and t[A''] — v- By Proposition 1, we 

can prove Theorem 1 by showing that 

^ < - (5) 

\c\- r 

For each T 6 G, we define the essential partition of T, as the partition of T 
generated by Q, when taking T and / as input. We divide G into disjoint clusters, 
such that each cluster is a maximal set of instances (in G) whose essential partitions 
are isomorphic. Let n be the total number of clusters in G, and Gj (j G [1, n\) the 
j-th cluster. Let Gj be a set containing the instances in Gj that associate o with 
V- In the following, we will show that |Gj|/|Gj| < l/l for any j e which will 

prove the theorem, as it leads to 

^ _ E^Ljqi ^ EU\c,\/i ^ 1 

\c\ - i' ^ ' 

Consider any T e Gj for some j £ Let P be the essential partition of T, 

m — \P\, and Gk {k G [1, the k-th Ql-group in P. Let P' be a partition isomor- 
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phic to P, and T' — IJg'g-P'" ^i^'^s ^' ^^"^ ^ involve the same set of individuals, 
T' is a possible microdata instance based on E. By the assumption on Q, we have 
Q{T',l) = T* . Therefore, T' e Cj. In other words, for any partition P' isomorphic 
to P, the microdata corresponding to P' is contained in Cj. Then, by the definition 
of Cj, \Cj\ should equal the total number of distinct partitions isomorphic to P, 
including P itself. According to the definition of partition isomorphism, we can 
obtain any partition isomorphic to P, by replacing any Ql-groups in P with their 
isomorphic counterparts. Let au be the number of distinct QLgroups isomorphic 
to Gfc. Then, the total number of partitions isomorphic to P should be Ha-Li ^fc- 
That is, \Cj\ = HfeLi «fc- 

Next, we will derive the value of |Cj|. Without loss of generality, assume that o 
appears in the first Ql-group Gi of Pi. Among the QLgroups isomorphic to Gi, let 
a\ be the number of QLgroups that associate o with a sensitive value v. Then, we 
have |Gj| = a'l • nr=2 ^fc- Therefore, |G;.|/|G,| = a'Jai. 

If V does not appear in Gi, then a'^ = 0. Otherwise, assume that Gi contains x 
sensitive values vi, V2, Vx, such that vi — v. Further assume that, there exist bi 
{i G [1, a;]) tuples in Gi with a sensitive value Vi. Then, there are ^^'^ |^ different 
combinations between the sensitive values and the individuals in Gi. Since each 
combination corresponds to a Ql-group isomorphic to Gi, we have 

Observe that, among the Oi combinations, there exist ^'^^^y^^/jj combinations that 
assigns a sensitive value Vi to o. Therefore, 



, , (|Gi|-l)! 

Hence, we have 



\Cj\ ai (|Gi|!)/E-=i(&'^0 



(8) 



(9) 



Since Gi is ^-diverse, we have 5i/|Gi| < l/l. Consequently, |Gj|/|Gj| < which 
completes the proof. □ 

Proof of Lemma 4. Given any two sets 5*4 and S[ of tuples, we say that they are 
cousins, if St and S'^ involve the same set of individuals. To prove the lemma, we 
first establish the following proposition: 

— Proposition 3. Let B and B' be two symmetric buckets, and {Pa,Pfc} be a 
division of B on [i G [1,^])- Let B'^ (P(,) be the subset of B' , such that 
Ba and B'^ [Bb and P(,) are cousins. Then, {P^,P^} is a division of B' on A^. 
Furthermore, B'^ and Ba (P(, and Bb) are symmetric. 

Let V be the signature of B, and x — \V\. Since B and B' are symmetric, V 
should also be the signature of B' . Because B'^ and P^ are subsets of B' , their 
signatures should be subsets of V . In the following, we will first show that B'^ is a 
bucket with a signature V . Assume that this is not true. Then, there must exist a 
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column L'l of B'^, such that \L[ \ > \B'J/x. Let Li be the subset of Ba, such that Li 
and L[ are cousins. Because B and B' are symmetric, if any two individuals have 
the same sensitive value in B', they should also have an identical A* value in B. 
This indicates that all tuples in Li share the same A'^ value. Since \Li\ > \Ba\/x, 
Ba should have a column with more than \Ba\/x tuples. This contradicts the 
assumption that Ba is a bucket. Therefore, must be a bucket with a signature 
V. By the same reasoning, it can be proved that i?^ is also a bucket with a signature 
V. 

Assume by contradiction that, {B'^,B'f^} is not a division of B' on Aj. Then, 
by Definition 11, there must exist two tuples t'a € B'^ and ij, € B'l^, such that (i) 
t',[A^] > t'.lAl], or (u) t'Ml] = t',[Aj] and t'alA^"] > t',[A% Let ta and h be the 
tuples in Ba, such that ta and t'a {th and tj,) concern the same individual. Then, we 
have either (i) talAf] > ifc[Af], or (u) talAJ] = t^Af] and ta[A'''] > tb[A"^]. In that 
case, {Ba,Bi,} is not a division of B, leading to a contradiction. Hence, {B'^,B'f^} 
must be a division of B' . 

To prove Proposition 3, it remains to show that and Ba (-B^ and Bf,) are 
symmetric. Consider any column L2 G Ba- We have IL2I = |-Ba|/a; = \B'^\/x. 
Let L2 be the cousin of L2 in B'^. Then, jL^ = IL2I = \Ba\/x. Since S and B' 
are symmetric, for any individuals with the same sensitive value in B, they should 
also share an identical A'' value in B' . Therefore, all tuples in L2 have the same 
sensitive value. Observe that, each column in B'^ should contain exactly \B'^\/x 
tuples, which indicates that L2 is a column in B'^. In summary, for any column L2 
in Ba , there exists a column L2 in B'^ , such that L2 and L'2 involve an identical set 
of individuals. Hence, B'^ is symmetric to Ba- Similarly, it can be shown that 
and Bh are symmetric. Thus, Proposition 3 holds. 

Now we are ready to prove the lemma. Without loss of generality, assume that 
{Bi, B2} is a division of B on A' {i £ [1, d]). Let i?3 (^4) be the subset of B' , such 
that -83 and Bi [B'^ and B2) are cousins. By Proposition 3, {B3, B'^} is a division of 
B' on Al, and B'.^^ {B'4) is symmetric to Bi {B2)- To establish the lemma, it suffice 
to show that {B'^,B^} is the canonical division of B' . Assume, on the contrary, 
that {B'^,B'^} is not canonical. Then, by Definition 12, {B'^,B'^} should satisfy at 
least one of the following three conditions: 

(1) {53,-64} is not a division of G" with the smallest perimeter. 

(2) There exists a division {B'^,B'q} of G" on A] {j < i), such that hp{B'^) + 
hp{B',) = hpiB!,) + hpiB',). 

(3) There exists a division {S^,-B^} of G" on A^, such that /ip(S^) + hp{B'^) = 

Assume that {^3, B'^} fulfills Condition 3. Let Bc^ (Bg) be subset of B, such that 
B5 and B'^ {Bq and B'q) are cousins. By Proposition 3, {-65,-66} is a division of B 
on Af. Then, I-B5I = I-B5I < I-B3I = \Bi\. Since each individual has the same QI 
values in B and B' , 

hpiBs) + hp{Be) = hp{B'^) + hp{B'^) = hp{B'^) + hp{B'^) ^ hp{Bi) + hp{B2). 

In that case, {Bi, B2} cannot be the canonical division of B (due to the existence of 
{-85, -Bg}), leading to a contradiction. Therefore, {-63,-64} must violate Condition 
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3. Similarly, it can be verified that {B'^,B'^} must also violate Conditions 1 and 2, 
i.e., {B'^,B'^} should be the canonical division of B'. Thus, the lemma is proved. 

□ 

Proof of Lemma 5. Consider that we apply Slice on Ui, with the given I value. 
As shown in Figure 8, Slice will iteratively retrieve a bucket B G Ui, compute the 
canonical division {Ba, Bi,} of B, and then replace B with Ba and Bi,. This process 
is carried on, until the bucket partition U[ is obtained. Let Qi be the union of the 
canonical divisions computed by Slice in each iteration, and Q'l = QiUUi. We 
organize the buckets in Q[ into \Ui\ binary trees as follows: 

(1) For the i-th {i G [1, |?7i|]) binary tree Ri, the root of Ri is the i-th bucket Bi 
in Ui. 

(2) For any three buckets ,62, S3 G Q' , B2 and ^3 are the child nodes of Bi, if 
and only if {B2, -63} is a division of Bi. 

We refer to Ri as the split history of Bi. Notice that, U[ equals the union of the 
leaves of each Ri {i G [1, |J7i|]). Next, assume that we apply Slice on 1/2- Let B'^ 
denote the bucket in U2 that is symmetric to Bi {i e [1, |J7i|]). Following the way 
Ri is generated, we also construct the split history R'^ of B'^. Then, the leaves of 
all R'i (i G [1, |?7i|]) constitute 1/2- To prove the lemma, it suffices to show that, for 
any i G [1, \ Ui\], each leaf of Ri is symmetric to a leaf of R'i, and vice versa. 

Our proof is by induction. For the base case, the root Bi of Ri is symmetric to 
the root B'^ of R'^. Next, assume that two nodes B € Ri and B' G R'^ are symmetric. 
We will show that (i) S is a leaf of Ri, if and only if B' is a leaf of R'^; (ii) if B is 
not a leaf, then each child node of B is symmetric to a child node of B' . 

As shown in Figure 8, a bucket in Ri or R[ is a leaf, if and only if it is not divisible, 
otherwise it would have been split into two smaller buckets by Slice. Because B 
and B' are symmetric, all columns in B and B' have an equal size. Thus, B is not 
divisible, if and only if B' is not divisible. Hence, i? is a leaf, if and only if B' is a 
leaf. 

Next, consider that B is not a leaf. Let {Ba, Bi,} and {B'^, be the canonical 
divisions of B and B' , respectively. By Lemma 4, Ba and B'a [Bb and B'^) must be 
symmetric. Therefore, each child node of B is symmetric to a child node of B' . By 
induction, it can be shown that each leaf of Ri is symmetric to a leaf of R'i, and 
vice versa. Hence, the lemma is proved. □ 

Proof of Lemma 6. Consider that we apply Assign on Ti with the given I value. 
As shown in Figure 6, Assign first initializes a set St = Ti, and then iteratively 
creates buckets using tuples in St. Let U be the partition returned by Assign at 
the end, Bi the bucket constructed in the i-th iteration, and Si the set of tuples 
in St right before the i-th iteration. Next, assume that we run Assign on T2. Let 
U' be the partition of T2 generated by Assign, B'^ the bucket created in the i-th 
iteration, and S'i the set of tuples in St prior to the i-th iteration. For simplicity, 
we say that two buckets are siblings, if and only if they have the same size and the 
same signature. In the following, we will first prove a proposition: 

— Proposition 4. for any i G [1, \U\], Bi and B'^ are siblings. 
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Consider that i = 1. By Lines 4-12 in Figure 6, the signature of Bi should 
contain the /3 most frequent sensitive values in St, and \Bi\ — a ■ P, where the 
values of a and /3 are decided by \Si\ and the frequencies of sensitive values in Si. 
The above statement still holds, if we change Bi to B[, and 5*1 to 5*^. Recall that 
Si ^ Ti — Usec/i ^ ^^"^ S'l = T2 = Us'eag ^ - ^^^'^^ Ui and U2 are symmetric, 
51 and S'l should have the same size, and include an identical multi-set of sensitive 
values. Therefore, Assign should employ the same a and /? values to construct Bi 
and B[. Thus, Bi and B[ are siblings. Furthermore, because S2 — Si ~ Bi and 
S'2 = S'l — B[, S2 and 52 should have an equal size, and contain the same multi-set 
of sensitive values. In turn, this indicates that. Assign should use identical a and 
/? values to generate B2 and B'2, i.e., B2 and B'2 are also siblings. By an induction 
on i, it can be shown that Proposition 4 holds. 

To prove the lemma, we regard U and U' as random variables, and show that 
Pr{U ^Ui} = Pr{U' = L/2}. Let us derive Pr{U ^Ui} first. Recah that each 
bucket in U is constructed using tuples randomly selected from Ti. Therefore, 
Pr{U = t/i} should equal l/m, where m is the total number of possible ways to 
assign the tuples in Ti into the buckets in U. Assume that Ti contains w sensitive 
values vi, V2, Let nj (j G [1,"^]) be the frequency of Vj in in Ti. Let dij 

denote the number of tuples in Bi with sensitive value Vj. For simplicity, define 
0! = 1. We have 

m= — TTTf^ . (10) 

nl=ln;=i(rf..!) 

Next, we will calculate PrjC/' = C/a}- Since Ti and T2 contain the same multi- 
set of sensitive values, for any j e [1,^], the frequency of vj in T2 is also nj. 
Furthermore, because Bi and (i £ [1, \U\]) are siblings, there should exist dij 
tuples in B'^ that have a sensitive value Vj. As a result, there are also m distinct 
ways to assign the tuples in T2 to the buckets in U' . Therefore, Pr{U' = 1/2} — 
l/m — Pr\^U — Ui} , which completes the proof. □ 

Proof of Theorem 2. Let T be any microdata, / be any positive integer, and 
T* be a possible output of Q. Let E be any external source, and S be the set of 
possible microdata instances based on E. Let o be any individual, v be an arbitrary 
sensitive value, and Sq^v be the subset of S, such that each T G Sg.v associates o 
with V. According to Proposition 1, we can prove Theorem 2 by showing that 

Ef6sMe(T,o = r*} - r 

We say that a bucket partition [/ is a valid partition, if T* can be decided by the 
partition U' — Gb{U). Let M be the set of all valid partitions, such that for each 
U e M, we have Pr{gA{f, l) = U)>G for some f G S*. Then, 

T.f^s.,^Pr{Q{f,l)=T*} ^ Y.feS.XueMPr{QA{f,l)^U} 

EfesPr{gif,l)=T*} j:fesJ:ueMPr{GA{f,l) = U} ' 

We define a bucket partition U G M as a breaching partition, if any Ql-group 
G E U contains a tuple t, such that t[A^'^] = o and t[A''] — v. Observe that, 
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for any T e So,v, if U is not a breaching partition, then Pr{QA{T,l) — U} — 0. 
We divide M into disjoint clusters, such that each chister is a maximal subset of 
symmetric bucket partitions in M. Let n be the total number of clusters in M, 
and Mj [j G [i,n]) be the j-th cluster. Let Mj be the set of breaching partitions 
in Mj. We have 

EfesP^{SiT,l)^T*} " EU^ueM,EtesP^{GA{f,l)^U} ■ 

For simplicity, let piU,f) denote Pr{gAif,l) = U}, and q{M,S) denote 
E[/eAfEfesP(^''^)- We will show that q{M!j, So,v)/q{Mj, S) < l/l for any 
j G [l,n]. This will lead to 

J:f^S.^^Pr{G{fj)^T*} ^ E;^,g(M;,5„,.) 

which proves the theorem. 

Without loss of generality, consider that j = 1. Let Uk be the fc-th {k E [1, |Mi|]) 
partition in Mi, and Tf. = ij BeUk P' microdata T different from Tk, we 

have p{Uk,T) = 0, since Uk is not a partition of T. Therefore, T^, G 5* should 
hold, otherwise p{Uk, T) = for all T G S, which contradicts the assumption that 
Uk G M. Thus, J2f£sP(^k,T) ~ p{Uk,Tk). By our assumption on Qa, for any 
/ci,fc2 e [1,1 Mil], we have p([/fei , Tfe J ^p{Uk^,Tk^)- Hence, 

|A/i| |A-/i 

J=i f es J=i 

Similarly, it can be verified that q{M[,Sov) — |M{| ■ p{Uk,Tk). Therefore, 
g(Mj,5„,„)MM„5) = |M(|/|Mi|. 

Next, we will derive the value of |Mi|. Let Us be any partition symmetric to Uk, 
and Ts = {Jseu P- Then, Tj and Tk should contain the same set of individuals. 
Hence, Tg G S. Since Us and Uk are symmetric, p{Us,Ts) = p{Uk,Tk) > holds. 
Therefore, Us is a valid partition. Let = Gb{Us), and QsiUk)- By our 

assumption on 5b, C^s and are symmetric. Observe that symmetric partitions are 
isomorphic, and thus, they always lead to the same anonymization. Since C/^ and 
/ decides T*, U's and / should also determine T*, which indicates that Us G M. In 
other words, any bucket symmetric to Uk should be contained in M. Consequently, 
by the definition of Mi, |Mi| equals the total number of partitions symmetric to 
Uk. 

By Definition 13, we can obtain any partition symmetric to Uk, by substituting 
any buckets in Uk with their symmetric counterparts. Let Bi be the i-th {i G 
[1, |f7fe|]) bucket in Uk, and Ui be the number of buckets symmetric to Bi. Then, 
|Mi| = n'=i' ^i- Without loss of generality, assume that o appears in Bi. Among 
all buckets symmetric to Bi, let a'l be the number of them that contain a tuple 
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t, with = o and t[A''] ^ v. We have \Mi\ = a[ ■ 111=2 Therefore, 

|M{|/|Mi| = 



Assume Bi has a signature V with x sensitive values, li v ^ V, then a'^ — 0. 
Consider that v € V . Recall that, we can transform Bi into any bucket symmetric 
to Bi, by swapping the sensitive values between different columns of Bi. Totally, 
there are x\ distinct ways to assign x sensitive values to the x columns of Bi. 
Because each of these assignment corresponds to bucket symmetric to Bi , we have 
ai — xl. Next, consider that we assign an A* value v to the column that o appears. 
The other a; — 1 sensitive values can be assigned in (a; — 1)! different manner, i.e., 
a'l = (x — 1)!. Hence, a'l/ai = 1/x. According to the way Assign constructs each 
bucket, we have x > I. Therefore, |M{|/|Mi| = a'l/ai < 1//, which completes the 
proof. □ 

Proof of Theorem 3. Let S the set of possible microdata instances based on E, 
and V be an arbitrary sensitive value. Let So,v be the subset of S, such that each 
T G So^v involves o, and sets v as the A"^ value of o. By Proposition 1, Theorem 3 
holds if and only if 



Consider that we apply Hybrid on any T £ S, with the given I value. Hybrid 
first employs Tailor to obtain a partition P of T. We define P as the essential 
partition of T, and use Gj to denote the j-th (j e [1, |P|]) QLgroup in P. Then, 
Hybrid invokes Ace to transform each Gj S P into a set T* of anonymized tuples. 
We define the ordered set {T;^*, Tj*, T|p|} as a decomposition of P. Since Ace is 
a randomized algorithm, there may exist multiple decompositions of P. At last. 
Hybrid returns the union T* of all T*. We use ^{P,T*) to denote the probability 
that Hybrid transforms P into T* . 

Let Q [Q') be a set that includes the essential partition of any T £ S {T £ 
So,v)- We divide Q into several clusters, such that each cluster is a maximal set 
of isomorphic partitions in Q. Let n be the total number of clusters in Q, and Ck 
(k € be the k-th cluster. Let = Ck H Q'. Then, we have 



j:f^g^^ Pr{HybrtdiT,l)=T*} ^ i 
j:f^sPr{Hybridif,l)^T*} " T 



(15) 



Pr{Hybrtd(f,l) = T*} _ Epgc^ liP,T*) 

Pr{Hybrid{f, I) = T*} ^ ELi Epec. 7(^, T*) 



(16) 



We will prove that, for any fc G [l,n], 




(17) 



j:^^S^^Pr{Hybrtd{f,l)^T*} ^ Epec^ 7(^, T*) 

j:f^sMHybrzd{f,l) = T*} " ELi Epec. liP.T*) 

- ELiEpec.7(^,T*) 



1 
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Without loss of generality, consider that k — 1. Let P be an arbitrary partition 
in Ci, and Gj the j-th Ql-group in P. Assume that o is involved in Gi. Further 
assume that, for any P' e Ci, the j-th {j £ [1, |P|]) Ql-group in P' is isomorphic 
to Gj. Then, for any P' G Ci, o should appear in the first Ql-group of P'. We 
split Ci into sub-clusters, such that for any two partitions in the same sub-cluster, 
they coincide on all but the first Ql-group. Let n' be the number of sub-clusters 
in Ci, Di {i e [!,«']) be the z-th sub-cluster, and D- — Di Ci Q' . To prove that 
Equation 17 holds for A; = 1, it suffices to show that, for any i G [!,«'], 

This is because, once the above inequality is established, we have 

Epec.^iP^T*) T:L^EpeD,liP,T*) 

^ EtiEpgD,7(P,T*)A ^ 1 
" EtiEpeD.^iP^T*) r 

Assume, without loss of generality, that i = 1. Let P^ {x G [1, \Di\]) be a;-th 
partition in and Gxm be the m-th (m G [1, \Px\]) QLgroup in P^. Let Sd be 
a set containing any decomposition of any P^ G Di, and fl be the subset of Sd, 
such that each decomposition W G ^ leads to T*, i.e., Ut*gw^s* = Let 
be the j-th decomposition in J7, and T*^ the m-th set of anonymized tuples in Wj. 
Observe that \Wj | = for any j G [1, |r2|] and any x G [1, \Di\]. By the definition 
of '-f{Px,T*), we have 

^{Px,T*) = E n Pr{Ace{Gx^J) = P,*,} (20) 

j — 1 m— 1 

For simphcity, we denote nlTd ^'?-{^ce(G:r„, I) = T;„J as p(P^, W'j). Then, 



(21) 



T.PeD',liP^T*) _ T.P^eD',T.';lMP^^W,) 
EpeD, 1{P^ T*) J2p^^^^ J2fJi PiP.,W,y 
To prove that Equation 18 is valid when i = 1, we will show that 

Ep^,n',p{P.,W,) ^ 1 

Ep^^D.piP^,w,) - V ^ ^ 

for any j G [1, In particular, the above inequality ensures that 
SpgD;7(P,r*) _ i:P^eD'j:fliP{P..W,) 

Ep^dMp^t*) j:p^^^^j:%piPx,w,) 

(by Equation 21) 
^ EfJ^Ep.^n.piP..W,)/l ^ 1 
EfJiEp^eD.PiP^^^W,) I 
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Let q{Gxm, T*^) denote Pr{Ace{Gxm, I) = ^/m}- Recall that any two partitions 
in Di coincide on all but the first Ql-group. Therefore, given any m G [2, \Px\] and 
any j G [1, |f2|], the value of q{Gxm,T*j^) is fixed for all Px G Di. Let rj denote 

n£i q{Gxm , T*^ ) ■ Then, 

PiP.,W,) = Y[Pr{AceiGxrnJ)^T;^} 

m—l 
m—l 

= r,-q{Gxi,T;,). (24) 

Therefore, for any j G [1, |f2|], 

Tp^eD^P^^W,) ' Y^PeD^rrqiGx^.T*,)) 

^ Ep,(^d[ qjGxi.T*^) ^^^^ 

Consequently, to prove Equation 22, it suffices to show that 

T,p^eD[li'^^i''Pji) ^ 1 ^26) 

for any j G [1, 

Let 5*1 be a set containing all Gxi {x G [1, |-Di|]), and 5*^ the maximal subset of 
^i, such that each G G S'J contains a tuple t with — o and t\A^\ — v. Then, 

^Pxg-p; (l{Gxi,T*i) Sees; l{G,T*^) 

By the definition of Di, all QLgroups in Si are isomorphic. Therefore, all QL 
groups in Si have the same projection on the identifier and QI attributes. Denote 
this projection as E. If we regard each QLgroup Gxi G 5i as a tiny microdata 
table, then E can be deemed as an external source for Gxi- Let 5*2 be the set of 
all possible instances based on E. Let S2 be the set of instances in 5*2 that contain 
a tuple t, with = o and t[A''] — v. By Theorem 2, given E as the external 

source, any Tj\ (j G [1, ensures that the disclosure risk of o is at most l/l, i.e.. 

By Equations 27 and 28, we can establish Equation 26 by showing that 



(29) 



For this purpose, it suffices to prove that q{G,T*i) = 0, for any G G (6*1 — ^2) U 
(52 - Si) and any G e {S[ - S*^) U (5^ - 5^). 
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Since 5*2 contains all microdata instances based on i?, we have Gxi G Si for 
any x G [IjDij]. Therefore, Si C ^2, which indicates that S'l C S2- Hence, 
Si - S2 ^ S'l - S'2 = 0. Now consider any € Di [x e [1, \Di\]). Assume that 
we construct a partition from Px, by replacing Gxi with any of its isomorphic 
counterparts. Then, P^ should be isomorphic to Px- By Lemma 3, P^ is an essential 
partition of some T £ 5', i.e., P^ G Q. Since P^ and Px are isomorphic, and coincide 
on all but the first Ql-group, P^ G Di holds. In other words, for any G isomorphic 
to Gxi, there exists a partition Pi £ Di {i e [1, li^il]), such that G = Gn. Hence, 
Si contains any Ql-group isomorphic to Gxi- 

Recall that, any T*-^ (j G [1, is a anonymization of a certain Ql-group in Si. 
Since all Ql-groups in Si are isomorphic, they contain the same multi-set of sensitive 
values. This indicates that any T*-^ and any Gxi G [1, have an identical 

multi-set of sensitive values. Let G' be a QLgroup, such that G' G S2 — S1. Then, G' 
and Gxi are not isomorphic, but involve the same set of individuals. Therefore, G' 
and Gxi must contain distinct multi-sets of A'^ values. Hence, for any j G [1, |r2|], 
the multi-sets of sensitive values in G' and T* are different, i.e., G" cannot be 
anonymized to T*. Therefore, q{G,T*i) = 0, for any G G 5i - 52. Similarly, it 
can be shown that q{G, T*^) = 0, for any G G ^2 — 5( . Thus, Equation 29 is valid. 
In turn, this establishes Equations 29, 26, 22, 18, and 17. Hence, the theorem is 
proved. □ 

ELECTRONIC APPENDIX 
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In this electronic appendix, we exemplify an attack against the Mask algorithm 
[Wong et al. 2007], which is designed under the credibility model [Wong et al. 
2007] . Figure 1 illustrates the pseudo-code of Mask. The algorithm takes as input 
a microdata table T, two positive integers k and I, and a subset V of the A'^ values. 
It aims to ensure that, for any individual o and any sensitive value v Cz V, the 
adversary would have at most l/l posterior belief in the event that "o appears in T 
and has a sensitive value w" . We will explain the details of Mask using an example. 

Example 1. Suppose that we apply Mask on the microdata Tg in Table I, by 
setting k ~ I = 2 and V — {dyspepsia}. Mask first generates a fc-anonymous 
partition P of Tg, using any of the existing fc-anonymity algorithms (Line 1 in 
Figure 1). Assume that P contains three Ql-groups, namely, {Ann, Bob}, {Gate, 
Don}, and {Ed, Fred, Gill}. Next, Mask divides P into two disjoint subsets Pi 
and P2 (Lines 2-6). In particular. Pi contains all the Ql-groups G in P, such that 
at least one sensitive value in V appears more than \G\/l times in G. Meanwhile, 
P2 = P — Pi. In our example. Pi contains only one Ql-group G' — {Ann, Bob}. 

After that. Mask randomly chooses a Ql-group G^ from P2, and then modifies the 
sensitive values in G' , so that G" and have the same sensitive value distribution 
(Lines 7-9). Assume that G+ = {Gate, Don}. Then, G' will be modified in a way, 
such that 50% of tuples in G' would have a sensitive value flu, and the other 50% 
would have dyspepsia. Table II illustrates a possible result of the modification. 
Finally, Mask returns the anonymization decided by the modified partition and an 
anonymization function (say, the MBR function), as illustrated in Table III. □ 

As (in Table III) is produced by Mask with 1 = 2, under the credibility model, 
an adversary has at most 1/2 posterior belief in the event that "Ann has dyspepsia 
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Algorithm Mask (T, k, I, V) 

1. generate a fc-anonymous partition P of T 

2. Pi = Pa = 

3. for each Ql-group G (z P 

4. if one of the sensitive value in V appears more than \G\/l times in G 

5. insert G into Pi 

6. else insert G into P2 

7. for each Ql-group G" £ Pi 

8. randomly choose a Ql-group £ P2 

9. modify the sensitive values in G' , such that the distribution of sensitive values in 
G' becomes the same as that in G^ 

10. return the anonymization decided by Pi U P2 and an anonymization function 

Fig. 1. The Mask algorithm 
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flu 


Gill 


60 


flu 


Gill 


60 


flu 




[54, 60] 


flu 



Table I. Microdata Tg Tabic II. Partition P' Table III. Generalization T*g 

in the microdata" . In the following, however, we will show that the posterior belief 
of the adversary can be boosted to 5/8, if s/he has (i) the details of Mask, (ii) the 
parameters k, Z, and V with which Tj^Q is computed, and (iii) an external source 
that contains only the seven individuals in Tg. 

Upon observing T^q, the adversary knows that Tj^g is generated from a partition 
P with three Ql-groups Gi = {Ann, Bob}, G2 = {Gate, Don}, and G3 = {Ed, 
Fred, Gill}. In addition, the adversary can infer that all sensitive values in G3 must 
have not been modified by Mask. Otherwise, the distribution of sensitive values in 
G3 must be adopted from another Ql-group in Table II, which is impossible since 
neither Gi nor G2 has the same sensitive value distribution as G3. On the other 
hand, the sensitive values in Gi and G2 may or may not have been modified by 
Mask. This leads to three different cases: 

(1) Both Gi and G2 have been modified. This case is impossible; otherwise, the 
distributions of sensitive values in Gi and G2 should have been transformed to 
the same as in G3, which is the only Ql-group in P that satisfies 2-diversity. 

(2) Either Gi or G2 has been modified. In this case, one of Gi and G2 should 
contain two dyspepsia before modification (since dyspepsia is the only value in 
V) , while the other one should have one flu and one dyspepsia. This results in 
4 possible microdata instances, 3 of which assign dyspepsia to Ann. 

(3) Neither Gi nor G2 has been modified. This leads to 4 possible microdata in- 
stances, 2 of which associate Ann with dyspepsia. 
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In summary, from the adversary's perspective, there exist 8 possible microdata 
instances that can be generahzed into T*q, among which 5 instances associate Ann 
with dyspepsia. Therefore, the adversary has 5/8 posterior behef in the event that 
"Ann has dyspepsia" . 
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Numerous generalization techniques have been proposed for privacy preserving data publishing. 
Most existing techniques, however, implicitly assume that the adversary knows little about the 
anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against 
privacy attacks that exploit various characteristics of the anonymization mechanism. This paper 
provides a practical solution to the above problem. First, we propose an analytical model for 
evaluating disclosure risks, when an adversary knows everything in the anonymization process, 
except the sensitive values. Based on this model, we develop a privacy principle, transparent 
l-diversity, which ensures privacy protection against such powerful adversaries. We identify three 
algorithms that achieve transparent (-diversity, and verify their effectiveness and efficiency through 
extensive experiments with real data. 
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1. INTRODUCTION 

Privacy protection is highly important in the pubhcation of sensitive personal in- 
formation (referred to as microdata), such as census data and medical records. A 
common practice in anonymization is to remove the identifiers (e.g., social security 
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